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THE ADEQUACY OF THE SHORTENED, SINGLE-LIST 
VOCABULARY TEST OF THE BINET-SIMON 
TESTS (TERMAN REVISION)! 


LILLIE BURLING PEATMAN 
Consulting Psychologist 
AND 
JOHN GRAY PEATMAN 
College of the City of New York 


As a rule a complete test is administered when a subject is brought 
to a psychologist for a Binet (Terman Revision) mental test. By a 
complete test is meant that a basal age is established at which every 
test at that age level is passed—with a scattering of successes above 
this level and, finally, an upper limit at which the subject is unable, 
ordinarily, to pass any of the tests. There are times, however, when it 
is expedient to shorten the test period as far as possible and still 
retain a reasonably reliable measure. For this reason the Abbreviated 
Scale is sometimes used in which certain tests at each age level are 
omitted, in accordance with the recommendations in the official 
Record Booklet of the tests. For example, omitted from the Abbrevi- 
ated Scale at Year VIII are the “ball and field” test and “definitions 
superior to use” and at Year XVI, Average Adult, “differences between 
abstract words” and the code test. 

Despite the existence of the Abbreviated Scale, it seems to be the 
more usual procedure for psychologists, when the necessity for shorten- 
ing the test period occurs, to give a complete test with the exception 
of the vocabulary test. In such a case, either the right or left side of 
the two lists of fifty words each, making up the vocabulary test, is 
administered and the results doubled in order to obtain a total score. 





1The authors wish to express their appreciation to Miss Gladys Tallman, 
Director of the Psychology Department of the Neurological Institute, for making 
available the data used in this study. 
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There are at least two important reasons for such a practice. First, 
with adolescents and adults, the vocabulary test usually requires 
more time to administer than any other test on the scale, and, thus, 
in many instances, at least as much time is saved by administering 
only one side of the vocabulary test as would be the case if the Abbrevi- 
ated Scale were used. The latter, of course, includes the vocabulary 
test, which, as Terman points out, ‘‘has a far higher value than any 
other single test of the scale. Used with children of English-speaking 
parents, it probably has a higher value than any three other tests 
in the scale. Our statistics show that in a large majority of cases 
the vocabulary test alone will give us an intelligence quotient within 
ten per cent of that secured by the entire scale” (p. 230).! A second 
reason for administering only one list of the vocabulary test is the 
implied recommendation of such a procedure in the Record Booklet 
itself. A footnote to the test states, in part, ‘If only one list is given, 
multiply the number of correct definitions by two to get the score.” 
Obviously, in order for this method of expediting administrative : 
time to be practicable, it is essential to determine whether it is a | 
valid procedure and, should there be any difference in the suitability | 
of the two lists of words, to determine which list doubled will give 
the more reliable result. It is in reference to this aspect of the meth- 
odology of Terman’s revision of the Binet tests that the authors set . 
for themselves the problem of the present article. Do the test 
: results of only one word list of the vocabulary test give a measure 
which may be considered, for practical purposes, as adequate as a 
measure derived from the administration of both lists? Furthermore, 
which list gives the more reliable result? During her experience in 
administering the Binet test, the first author has often felt that the 
two lists were not equivalent in difficulty and that one list, therefore, 
might be superior in reliability to the other. 


PROCEDURE 


-—_acii> An oi> ne eee 


The procedure of this study involved getting a fairly large sample 
of Binet test results and analyzing the subjects’ vocabulary test 
performance in order to learn whether the administration of only one 
list may have adequate empirical justification. 

1. Subjects Used as the Sample.—It is evident that the sample of 
subjects used in such an investigation is of major importance, in so far 





1Terman, L. M.: The Measurement of Intelligence. Cambridge, Houghton di 
Mifflin Co., 1916. as 
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as any general recommendations which might be made from. the 
results are concerned. Although it is impossible to determine accu- 
rately the adequacy of the authors’ sample of six hundred forty-two 
subjects, it is highly probable that it is fairly representative of the 
group of cases which arrives at a psychological clinic from a large 
urban population. Typical of such a situation is the fact that cases 
are drawn from a variety of sources. Specifically, in the.authors’ 
sample, all of the actual testing was done by several examiners, 
including the first author, either at the Neurological Institute or at 
Vanderbilt Clinic of the Columbia Medical Center in New York City. 
The cases referred by the doctors, by the Clinic, by charity organiza- 
tions or by the Board of Education were sent for any one of many 
reasons—emotional maladjustment, school problem, home problem, 
vocational guidance, etc. Those cases referred by Hope Farm were 
sent in order to determine their intelligence level before being accepted 
into or rejected from that community; those referred by the Nursery 
and Child’s Hospital, to determine the patient’s intelligence level 
before placing him in a foster home. The following list briefly 


summarizes the source of the six hundred forty-two subjects used 
in this study: 











Source of the subjects Number Per cent 
EE SE AEE Pe 174 27 
ete cae ee ced dwweemees kt 149 23 
id al Ce a ei alls wae k el ae 65 10 
Jewish social service organizations................. 64 10 
Board of Education and public schools.............. 40 6 
RI re 38 6 
Nursery and Child’s Hospital..................... 27 4 
Referred for vocational guidance.................. 25 4 
Non-sectarian social service organizations........... 24 4 
Protestant and Catholic social service organizations. . 22 3 
Ne ihe E ee eo cies te oak dai’ 14 2 

2etel mumber of eublects............ ccc ccccccee 642 











Of the total group of subjects two hundred fifty-four were females 
and three hundred eighty-eight males. Their chronological age 
differences are summarized into three age groups, which were arranged 
as being most relevant to the range of difficulty of the vocabulary test: 
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Chronological age groups Females Males Total 
Ee ere 45 52 97 
Eight to sixteen years..................... 135 238 373 
Sixteen years and over...................5. 74 98 172 














Their mental age differences, as determined from the administration 
of the Binet tests, are summarized into six groups, which again were 
arranged as being relevant to the vocabulary test: 








Mental age groups Females Males Total 
CN 56 81 137 
CE 60 96 156 
een OD GUNN WOME... 5 ccc ccc ccccccscs 51 72 123 
Twelve to fourteen years................... 44 54 98 
Fourteen to sixteen years. .................. 25 51 76 
Sixteen years and over..................... 18 34 52 














The intelligence quotients of the subjects are summarized into eight 


groups, as follows: 








IQ groups Females Males Total 
a ee thin ne ee 54 55 109 
ehh g | UL: 2s edd bee eee 18 60 78 
80 to 90 41 48 89 
CS ee 50 78 128 
es ls od wha Ged ao oe ae ON 49 46 95 
SERIE cre eae ee ae 17 40 57 
i nce eed ca cewadvakaenee newt 16 35 51 
EE CEP REET OPE TT 9 26 35 














It is evident that the subjects do not comprise a group which tends 
to be distributed “‘normally,” in so far as IQ differences are concerned. 
Such a deviation is, of course, to be expected, since this is not a sample 
of the general population’s intelligence quotients but, rather, a 
sample of those individuals from the population who are referred to 
clinics for a mental test: 

As regards the administration of the Binet tests to these subjects, 
it is hardly necessary to point out that a complete test was always 
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given, all of the examiners having been thoroughly trained and 
experienced in the proper administration of the Terman Revision. 

2. Analytical Procedure.—Each subject’s vocabulary performance 
on the right and the left list of words in the Record Booklet was 
recorded with his total vocabulary score. Then each of his scores 
on the two lists was doubled and the results compared with his actual 
(obtained) total vocabulary score, which in this case will be taken 
as the criterion of validity. These comparisons for the group were 
made in reference to the correctness with which each subject’s obtained 
mental-age credit on the total vocabulary test could have been pre- 
dicted by (a) doubling the score made on the right list of words and 
(b) doubling the score made on the left list. The following sum- 
marizes the seven groups into which a subject’s vocabulary per- 
formance may be classified, according to the total number of words 
in the two lists of fifty words each which he correctly defines: 


Noumser or Correct DEriniTIONns Menta Acs Leva. 
Less than twenty words............. Below Year VIII 
Twenty to thirty words............. Year VIII 
Thirty to forty words............... Year X 
Forty to Gfty words. .......ccccccee Year XII 
Fifty to sixty-five words............. Year XIV 
Sixty-five to seventy-five words...... Year XVI (Average Adult) 
Seventy-five words or more.......... Year XVIII (Superior Adult) 


In predicting a subject’s total vocabulary score by doubling his 
performance on only one of the two lists of words, it is obvious from 
the above seven divisions of score classification that the question for 
investigation is the certainty with which his obtained year level can 
be predicted by doubling the score made on only one-half of the test. 
Prediction is therefore a problem of correct placement into one of 
seven groups which may spread ten units (words) or more, rather 
than into a score position of a given unit value. 

In order to compare the certainty with which the right and left 
lists give accurate estimates of the total vocabulary score, the authors 
further enumerated for the group the per cent of incorrect estimates 
which fell within + one unit of the six critical points of the single 
list scale. These critical points are scores which, when doubled, 
just place the subject into one of the seven year levels of mental age 
already referred to. They are: 10, 15, 20, 25, 32.5 and 37.5. Thus 
a subject with a total vocabulary score of forty-two words has an 
achievement at the XII year level. If he had twenty-three correct 
definitions on the right list, this score doubled gives an estimated 
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total score of forty-six, which correctly places him. However his 
score of nineteen on the left list would, when doubled, be only thirty- 
eight and therefore place him incorrectly. This would be an example 
of an incorrect estimate which lies within one unit of the critical 
point here involved, viz., twenty words, and is, therefore, indicative 
of greater reliability for the list of words than would have been the 
case if the score had deviated farther below the critical point. For 
the total group of subjects, the obtained exceptions to accurate 
estimates were of the kind just noted, 7.e., in which the score on only 
one of the two lists gave, when doubled, an incorrect estimate. Ina 
few instances, both of the single-list scores gave an incorrect total 
score estimate. Thus, a subject with single-list scores of seventeen 
and twenty-five for the right and left lists respectively would not have 
been correctly placed into the XII year level (his obtained placement), 
since seventeen doubled is thirty-four (X year level) and twenty-five 
doubled is fifty (XIV year level). 


RESULTS 


The results of the authors’ analysis for the total group of subjects 
and for differences in sexes are summarized in Table I. It is notable, 
first, that there is practically no difference between the results of the 
male and female groups. In fact, the percentage of exceptions made 
by the two sex groups is so nearly the same as to suggest that the 
total sample is adequate for the purposes of this analysis. 

In the second place, it is evident that in terms of the actual number 
of exceptions made it does not seem to make any difference in the 
validity of the shortened test procedure whether the right list or the 
left list of words is administered and the results doubled for the total 
vocabulary score. On the right list, 18.8 per cent of the total group 
of subjects were not correctly placed when their scores were doubled 
and compared with their obtained total scores. On the left list, 
the per cent of such exceptions was 18.7. Thus, in practically one 
case in five, a subject’s total vocabulary score would have been in 
error if estimated from the results of only one of the lists of words. 

In the third place, the further analysis reported in Table I reveals 
that the two lists are not comparable in “reliability,” 7.e., the left 
list is found to be the more reliable in the sense that over seventy-five 
per cent of the obtained exceptions on this list were found to deviate 
no more than one unit below or above their respective critical points 
(viz., 10, 15, 20, 25, 32.5, or 37.5) of the single-list vocabulary scores, 
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whereas nearly fifty per cent of the exceptions obtained with the right 
list deviated more than one unit. 


The results for the left list of words thus indicate that only about 
five per cent! of the total group of six hundred forty-two subjects 


TaBLeE I.—EstimaTiIna ToTaL VocABULARY ScorRES FROM EITHER HALF OF THE 
VocaBULARY TEST—RESULTS FOR DIFFERENT SEXES AND ToTaL Group 








, Females, Males, All, 
Number of subjects O54 388 642 
1. Number of exceptions resulting from 
(a) Doubling scores on right list.......... 49 72 121 
(b) Doubling scores on left list........... 46 74 120 
2. Per cent of exceptions resulting from 
(a) Doubling scores on right list.......... 19.3 18.6 18.8 
(b) Doubling scores on left list........... 18.1 19.1 18.7 
3. Number tof obtained exceptions falling 
with +1 of critical points! 
i cisjkudseeksis thaw scaeees 25 38 63 
Te Pe 36 56 92 
4. Per cent of obtained exceptions falling 
within +1 of critical points* 
TT etch conta biasbesencewea 51.0 52.8 §2.1 
i iske tb cidade bhageseknde ak 78.3 75.7 76.7 














1 As explained in the text under Procedure, these critical points of the single- 
list scale are scores which, when doubled, just place the subject within the lower 
limit of his correct total vocabulary age level. 

2 As implied, these percentages are the ratios of the number of obtained excep- 
tions falling within +1 of the critical points to the total number of exceptions. 


obtained scores which deviated more than one unit from their respec- 
tive critical points on the single-list scale. The authors therefore 
thought it might be possible, in considering the value of the shortened 
vocabulary test, to increase the validity of the procedure by using 
the left list and adding one unit to any scores falling not more than 
one point below their respective critical points. Thus a subject 
with a score of 9; 14, 19, 24, 31.5 or 36.5 on the left list might be given 
an extra word credit and his new score doubled in order to obtain his 
total vocabulary score. In making such an adjustment to the border- 





1 Since 18.7 per cent of the total group of subjects had scores on the left list 
which, when doubled, gave incorrect total vocabulary scores, and since 76.7 per 
cent of these exceptions deviated within one unit of their respective critical points, 
therefore less than one-quarter of the 18.7 per cent, or about five per cent, had 
deviations greater than one unit. 
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line scores, however, the authors found that as many new exceptions 
for the total group of subjects were introduced as were eliminated. 
Similar results were obtained when one point was also deducted for all 
scores on the left list which lay not more than one point above their 


TaBLE IJ].—EstTimaTIna ToTaL VOCABULARY SCORES FROM EITHER HALF OF THE 
VocaBULARY TEsST—REsULTS SUMMARIZED FOR CHRONOLOGICAL-AGE 











DIFFERENCES 
Below eight Eight to Sixteen years 
years sixteen years and over 
3|.| |2ls| lle 
| Ss | 
Sialoi 81 Sie aio 
SlSlSl}e\ sl] 2 a] a] = 
Number of subjects............... 45 | 52 | 97 |135 |238 |373 | 74 | 98 |172 
1. Number of exceptions from 
(a) Doubling right list.......... 1 1| 2] 25 | 47] 72] 23 | 24) 47 
(6) Doubling left list........... 1 3} 4] 30) 50; 80] 15 | 21 | 36 
2. Per cent of exceptions from 
(a) Doubling right list.......... 2; 2); 2] 19) 20] 19] 31 | 25 | 27 
(6) Doubling left list........... 2} 6] 4] 22] 21 | 21] 20); 214) 21 
3. Number exceptions within +1 
of critical points 
ee a Oo; 1 1] 13 | 27 | 40] 12 | 10} 22 
lo ra i lal ah de 1}; 2] 3] 25 | 38} 63] 10) 16] 26 
4. Per cent exceptions within +1 
of critical points 
en 0 |100 | 50 | 52 | 57 | 56} 52 | 42 | 47 
er er 100 | 67 | 75 | 83 | 76 | 79 | 67 | 76 | 72 
































respective critical points. Inasmuch as this failed to give any increase 
in the validity of the shortened vocabulary test, the authors made a 
further analysis of their data in order to learn whether the hetero- 
geneity of their sample, as regards (1) chronological age, (2) actual 
vocabulary ability, (3) mental age, and (4) intelligence quotient, 
might not be operating to reduce the validity of the single-list pro- 
cedure, as obtained in Table I for the total group of subjects. 
Chronological Age Differences.—In administering the Binet test, 
the examiner knows in advance the chronological age of his subject. 
Consequently, if there are differences in the validity and reliability, 
as herein defined, of the shortened, single-list vocabulary test for 
subjects of varying ages, this knowledge could be applied in the 
administration of the test. With this possibility in view, the authors 
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classified the subjects into three chronological-age groups: (1) Those 
below eight years (since, on the average, children below eight years 
are not expected to define twenty of the words, the minimum for any 
credit on the vocabulary test), (2) those from eight to sixteen years 
(being the range up to the Average Adult Level), and (3) those of 
sixteen years and over. The results of Table II reveal important 
differences in the validity and reliability of the shortened, single-list 
vocabulary test when these differences in the subjects’ ages are 
recognized. 

First, as to the validity of the two lists: It is evident that the use 
of only the single list is a highly valid procedure for the subjects 
below eight years of age, since practically no exceptions from doubling 
the scores were found—two per cent for the right list and four per cent 
for the left one. For the group of eight to sixteen years, the validity is 
practically the same as was found for the total group of subjects, there 
being about twenty per cent incorrect vocabulary score placements 
when the single-list scores are doubled. For the group of sixteen 
years and over, the right list becomes less valid, the exceptions increas- 
ing to twenty-seven per cent; whereas the left list continues to have 
about twenty per cent exceptions. Thus, so far as accuracy of total 
vocabulary score estimates from single-list scores are concerned, there 
is a tendency for the right list to be preferable for children below 
eight years of age, for either list to be suitable for subjects eight to 
sixteen years, and for the left list to be definitely superior for subjects 
sixteen years and over. 

Second, as to the reliability of the two lists: As estimated in terms 
of the extent to which exceptions deviate less than + one unit from 
their respective critical points on the single-list scale, there are so few 
exceptions for the group below eight years of age as to make this 
criterion practically insignificant in this case. For the two older 
age groups, however, the left list is definitely superior to the right 
list, since in both cases seventy-two to seventy-nine per cent of the 
exceptions of the left list fall within the range of one unit, whereas 
only forty-seven to fifty-six per cent of the exceptions of the right 
list fall within this range. Thus, by this criterion, the left list is 
definitely superior for the subjects eight years of age and over. 

Total Vocabulary Score Differences——The results of Table III 
reveal that there are important differences in the validity and relia- 
bility of the shortened, single-list vocabulary test when the subjects’ 
differences in achieved vocabulary ability on the total test are taken 
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into consideration. As indicated in the Table, the subjects were 
classified into three vocabulary performance groups: Those defining 
correctly less than twenty words and, consequently, receiving no 
vocabulary credit; those defining correctly from twenty to forty 
words and, consequently, being given credit at the VIII or X year 
levels; and those defining correctly forty words or more and, conse- 
quently, being given credit at the XII year or higher levels. 
Noteworthy is the fact that there are practically no incorrect 
placements (only three in two hundred twenty-nine cases) with the 
right-list scores doubled for subjects receiving a total score of less 
than twenty words, whereas fourteen per cent are incorrectly placed 
with the left list of words. The superiority of the right list over the 


TaBLeE III].—EstTimaTING ToTAL VOCABULARY SCORES FROM EITHER HALF OF THE 
VocaBULARY TEST—REsULTS SUMMARIZED BY TOTAL VOCABULARY SCORE 














DIFFERENCES 
Total vocabulary score groups 
Below twenty Prcinaapind to Forty or 
onnendh eeendn orty correct | more correct 
words words 
3/. z].| |2l- 
a a e 
plal/na|/elalaléle|- 
Slzili<zj/elzi<l/e/|a] < 
Number of subjects............... 92 |137 |229 | 75 |116 |191 | 87 |135 |222 
1. Number of exceptions from 
(a) Doubling right list.......... 0; 3] 3] 23 | 32 | 55] 26 | 37 | 63 
(b) Doubling left list........... 11 | 20 | 31 | 20 | 34 | 54] 15 | 20} 35 
2. Per cent of exceptions from 
(a) Doubling right list.......... 0; 2 1 | 31 | 28 | 29] 30 | 27 | 28 
(b) Doubling left list........... 12 | 15 | 14] 27 | 29 | 28] 17 | 15 | 16 
3. Number exceptions within + 1 
of critical points 
ee 2} 2] 131] 17] 30] 12 | 19} 31 
rs 6 ik Su gine ew een cine 6 | 17 | 23 | 17 | 25 | 42] 13 | 14] 27 
4. Per cent exceptions within + 1 
of critical points 
lee .. | 67 | 67 | 57 | 53 | 55 | 46 | 51 | 49 
PR c0cc4cedesornnenea 58 | 85 | 74 | 85 | 74 | 78 | 87 | 70 | 77 
































left list for this class of subjects cannot in practice, of course, be used 
to full advantage when only the shortened form of the vocabulary 
test is being administered. But taken in conjunction with the results 
of Table II, it is evident that the Binet administrator will obtain 
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more valid results with the right list than with the left list when 
testing children of less than eight years of age. 

As for the other two sub-groups of Table III, it is evident that 
there is practically no difference in the percentage of incorrect total 
score placements, using either the right or left list, for subjects achiev- 
ing twenty to forty words, and that, furthermore, the per cent of 
exceptions is so high (nearly thirty per cent) as to make very ques- 
tionable the use of the shortened vocabulary test for subjects of this 
level of vocabulary ability. Practically one in three subjects is 
incorrectly placed when the single-list score is doubled. On the 
other hand, for those subjects defining correctly forty words or more, 
the left list of words is definitely superior in terms of correct predic- 
tions to the right list, there being sixteen per cent exceptions with 
the former and twenty-eight per cent with the latter. The left list 
of words is also seen to be superior to the right list for this group 
inasmuch as seventy-seven per cent of the obtained exceptions lie 
within + one unit of their respective critical points on the single-list 
scale, whereas more than fifty per cent of the exceptions on the right 
list lie beyond this range. Practically the same difference is also 
characteristic of the exceptions for those subjects in the twenty-to- 
forty-word group. As for those subjects defining less than twenty 
words, there is less difference for the two lists, but this criterion is not 
particularly important here since, as already pointed out, the actual 
exceptions on the right list were only three in two hundred twenty-nine, 
compared to thirty-one in two hundred twenty-nine for the left list. 

Mental Age and Intelligence Quotient Differences.—In order to 
ascertain whether their results might be peculiar to a large group 
heterogeneous in mental age and intelligence quotient, the authors 
analyzed their data by classifying their subjects into six mental 
age groups and eight IQ groups (see description of the sample under 
Procedure). When the vocabulary performance of the subjects in 
these various sub-groups was analyzed, the results were found to be 
consistent with the findings as already presented. Consequently, 
the detailed data are not summarized here in tabular form. 


SUMMARY INTERPRETATION 


Assuming the present sample of six hundred forty-two subjects 
to be adequately representative of members of both sexes who are 
referred from large urban populations to clinics for psychological 
tests, and assuming that occasions arise which make it expedient to 
use the shortened, single-list vocabulary test of the Terman Revision 
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of the Binet scale, the authors believe their foregoing analysis warrants 
the following general recommendations: 

1. If asubject to be tested is less than eight years of age, administer 
the right list of words; 

(a) If his score on this list is less than ten correct definitions, it 
is highly probable that the score doubled will be an adequate estimate 
of his placement on the total vocabulary test; 

(b) If his score is ten correct definitions or more, it is advisable to 
administer the second list of words, #.e., to employ the whole vocabulary 
test. 

2. If a subject is eight years of age or older, administer the left list 
of words; 

(a) If his score on this list is less than ten correct definitions, it 
is highly probable that his score doubled will be an adequate estimate 
of his placement on the total vocabulary test; 

(b) If his score is at least ten but not more than twenty correct 
definitions, the chances are only about two in three that his total 
vocabulary placement will be correct, when obtained by doubling his 
score; consequently, in such cases it is advisable to administer the 
second list of words, 7.e., to employ the whole vocabulary test; 

(c) If his score is twenty or more correct definitions and if it is 
of a value that is more than one unit from the nearest critical point 
of the single-list scale (t.e., if the score is twenty-two to twenty-three, 
twenty-seven to thirty-one, thirty-four to thirty-six, or thirty-nine 
and over), there is little likelihood that an error will be made in taking 
his doubled score as the index of his placement on the total vocabulary 
test. If, on the other hand, his score is within one unit of one of the 
critical points (7.e., within one unit of 20, 25, 32.5, or 37.5), it is then 
advisable to administer the second list of words, t.e., to employ the 
whole vocabulary test. 

These recommendations are made regardless of the sex of the 
subject, since it was found that such a factor of difference did not 

affect the trend of the results. 

| Finally, as was emphasized at the beginning of this article, it is 
always preferable, when possible, to administer a complete Binet test. 
The foregoing recommendations are made only in reference to those 
instances in which there arises a real necessity for shortening the test 
period and the examiner therefore decides to meet such an exigency 
by using the shortened, single-list vocabulary test rather than, or in 
conjunction with, the Abbreviated Scale of the Terman Revision. 





sa re i. a ——_ -— a 








ae —— ——- 


-_— 


\v 


s 


te 
st 


A CRITICAL NOTE ON THE USE OF THE TERM 
“RELIABILITY” IN MENTAL MEASUREMENT 


FLORENCE L. GOODENOUGH 


Institute of Child Welfare, University of Minnesota 


In the September, 1935, number of this Journal, Jordan calls 
attention to certain systematic discrepancies in the “reliability 
coefficients”’ obtained by (a) correlating one form of a test with 
another, presumably equivalent, form and (b) correlating the odd items 
of one form with the even items of the same form and applying the 
Spearman-Brown prophecy formula to the result. 

Jordan reports that in every one of fourteen comparisons made, 
the coefficient obtained by ‘‘stepping up” the correlation between 
the odd and even items of a given form exceeded in magnitude those 
obtained by correlating one form with another, and that in eight of 
the fourteen comparisons the differences were more than three times 
their standard errors. 

This finding is in complete accordance with the results of several 
studies carried on with young children at the University of Minnesota 
Institute of Child Welfare. For the most part, our data involve a 
comparison between the “split scale”? method and the ‘“‘test-retest”’ 
method, since for many of the tests used no comparable second form is 
available. It has been generally presumed that in the latter case 
memory transference leading to correlation between errors is likely 
to result in a spuriously high ‘reliability coefficient”? unless the 
interval between testings is so long that memory factors may fairly 
be expected to have become inoperative.! This may be true when the 
subjects are adults or older children but I have seen no empirical 
data to substantiate the hypothesis. It has further been assumed 
that psychological disparity of items, which is almost certain to creep 
in when dissimilar forms of a test are used, will tend to lower the 
obtained correlations below the level that they would take if each 
form truly measured exactly the same aspects of ability or the same 
fields of specialized knowledge. The same condition obviously holds 
when the odd-item versus the even-item method of correlation is used, 
since it is highly unlikely that the matched items will in all cases be 
absolutely similar in a psychological sense. 





1 Cf. Kelley, T. L.: Statistical Method. Pp. 201-203. 
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If this theory holds, with no counteracting influences to obscure 
the results, we should expect to find (a) that the correlation between 
an initial test and a retest on the same material after a brief time- 
interval would exceed that obtained by the split-scale method after 
applying the prophecy formula and (b) that no reliable difference 
would be obtained between the results of the form-versus form procedure 
and that obtained by “stepping up” the correlations between the 
sums of the alternate items, provided, of course, that the two half- 
scales thus produced are psychologically as similar to each other as 
are the two complete forms of the test. 

Empirical evidence, however, fails to substantiate either of these 
assumptions. In a study of the reliability of the Kuhlmann-Binet 
tests for preschool children that was carried out at the University 
of Minnesota Institute of Child Welfare some years ago! it was found 
that for each of six groups of children, the correlation between the 
intelligence quotients earned on an initial test and a retest given after 
an average interval of ‘six weeks was lower than that obtained by 
applying the Spearman-Brown formula to the correlation between the 
sums of alternate items; and this in spite of the fact that the half-scales 
thus formed were made up of very dissimilar items. While it is of 
course possible that irregular rates of individual mental growth during 
the six weeks’ interval between testings may have been partially 
responsible for lowering the correlation between initial test and retest, 
it seems unlikely that this is the major factor involved. Rather, I am 
inclined to favor the explanation suggested by Jordan, 1.e., that vari- 
ability in the subjects tested? with respect to such matters as motiva- 
tion, fatigue, interest and effort and other non-intellectual factors is 
chiefly responsible. 

Our study, therefore, agrees with that by Jordan in finding the 
correlation between the sums of the alternate items in a test to be 
consistently higher than that obtained by using another accepted 





1 Goodenough, Florence L.: The Kuhlmann-Binet Tests for Children of Pre- 
school Age; a Critical Study and Evaluation. Minneapolis: University of Minne- 
sota Press, 1928. 

2 It should be noted in this connection that in our study all cases in which the 
child did not seem to be giving adequate coéperation were thrown out. This 
included all cases showing marked negativism, shyness or other evidences of 
emotional disturbance. The results cited above suggest that even the judgment of 
trained examiners accustomed to dealing with little children is insufficient to 
ascertain when such factors are present in sufficient degree to affect the results of a 
test when the overt behavior of the child is the only available criterion. 
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method of measuring ‘‘reliability.”” Jordan compared the alternate- 
item method with the equivalent-form method; we compared the 
alternate-item method with the test-retest method. Jordan used as 
subjects college freshmen and high-school seniors; we used children of 
preschool age. Jordan’s tests were all given at a single sitting; our 
retests were given after an interval of approximately six weeks.' The 
fact that both studies show consistently higher values for the correla- 
tions obtained by the method employing the sums of alternate items 
should be sufficient evidence that the practice of applying the single 
term ‘“‘reliability” to results of such varied meaning is scientifically 
indefensible. 

There remains the question, Which of the three methods is to be 
preferred? Here I find myself unable to accept Jordan’s conclusion 
that ‘‘the coefficient derived from correlating even items with odd 
items is probably a truer measure of the reliability of a test than that 
obtained from correlating duplicate forms because the even-odd 
method eliminates pupil variability.” Jordan appears to draw a sharp 
distinction between what he calls the “variability of the student” and 
the ‘‘reliability of the test.” Actually, of course, the two are insepa- 
rable. It is the pupil’s performance and not the printed test-form that 
constitutes the test and if pupil-variability exists to an extent that 
affects the test-results, we do not get rid of it by dividing it equally 
between the two variables that are to be correlated. If it is fair to 
think of the variability introduced by non-intellectual factors, such as 
those that have been cited, as ‘‘errors’”’ comparable in significance with 
the ‘‘memory transference,’ which Kelley suggests may spuriously 
raise the reliability coefficients obtained by the test-retest method when 
the interval between testings is short, then it is evident that dividing 
these errors more or less equally between the two sets of items to be 
correlated, as is done when the alternate-item technique is employed, 
will involve a similar correlation between errors and a correspondingly 
spurious increase in the “reliability coefficient” thus obtained. Pupil- 
variability is not eliminated by this method, nor is it wholly covered 
up, but although it may perhaps be looked upon as a source of error? 





1 Since intelligence quotients rather than mental ages were used in computing 
the correlations between tests, the fact that the length of the interval between 
testings varied slightly from one child to another (though the differences were in all 
cases small) should not seriously affect the results. 

?T am not at all sure that it is fair to regard pupil-variability, in the sense that 
Jordan uses the term, as a true source of error in measurement. If such variability 
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as far as its effect upon our measurement of the thing we set out to 
measure is concerned, by the use of the alternate-item method such 
variability actually increases the apparent “reliability” of the 
measurement. 

The fact is that no individual, whether he be child or adult, always 
maintains his own maximum level of efficiency. Always, a certain 
amount of variation above and below his own average level of per- 
formance can be observed. There is evidence that a given direction 
of variation (above or below his own mean performance) is not wholly 
an evanescent phenomenon but tends to persist for a longer or shorter 
period of time. The actual duration of a given period of better-than- 
average or poorer-than-average performance as well as the extent of its 
deviation from the mean is presumably affected both by intrinsic and 
extrinsic factors, but in no case will it be maintained indefinitely. 
Thus, in any finite sample of behavior, such as the performance on 
some standardized test, some children will do better than is their wont, 
others will do more poorly, and still others will shift their level of 
performance from one part of the test to another. 

Now if, by an artificial shifting of the time-factors, we iron out 
these variations which are very real characteristics of pupil perform- 
ance, we shall give an appearance of stability to our method which 
does not in fact exist. Human beings do vary in performance from 
time to time. It is highly unlikely that the sampling of behavior 
obtained within the time-limits of an ordinary test is sufficiently large 
to yield a true measure of the average performance of each individual 
tested, and it is also unlikely that all will maintain a constant level of 
efficiency from beginning to end of the test. And since the reliability 
of a test is not a thing apart from the pupil but is definitely a function 
of pupil-performance, any method that tends to cancel out the pupil- 
variability that is an integral part of such performance will give an 
erroneous picture of the facts. 

Which is the best method? I do not think the question can be 
answered by any rule-of-thumb statement. It depends on the 
particular conditions of testing, the age and other characteristics of 





exists, then it is not an error but a fact with which we have to deal. If certain 
pupils, because of fatigue, boredom, or other factors of a similar nature, perform 
less well during the administration of a second form of an intelligence test than 
they did during the first form while others become gradually warmed up to the 
task and actually improve their performance, it is but fair to ask whether or not 
corresponding changes take place in the broader situations of everyday life. 
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the subjects and, most of all, upon what the investigator wants to 
find out. What we should do, I think, is to relegate the use of the 
term ‘‘reliability”’ to the limbo of outworn concepts and express our 
results in terms of the actual procedure used. It is quite as easy to 
speak of the “correlation between test and retest” after a stated 
interval or between “‘the sums of alternate items” or between ‘‘ equiva- 
lent forms”’ of a test as it is to use the more conventional but far less 
accurate expression, “‘reliability.’”” Moreover, it is my opinion that 
the term itself has led to much confusion of thought among practical 
educators who try to keep up with the research literature but are not 
themselves actively engaged in research. Even among the initiate, 
who, if questioned, will glibly repeat the Otis definition to the effect 
that ‘‘the reliability of a test indicates the degree to which it is con- 
sistent in measuring whatever it does measure,” there is, I suspect, a 
lingering feeling that a statement about the reliability of a test of 
intelligence is equivalent to a statement about its dependability as a 
test of intelligence. By giving up the name while retaining the proc- 
esses we shall gain in precision of thought and expression with no loss 
of informative data. 

Since the three methods are no longer regarded as equivalent we 
shall not attempt to decide, once and for all, which is best, but in each 
case we shall select that one which seems best suited to yield the 
information of which we are in search. A comparison of one method 
with another may provide further significant factors. If, for example, 
we find that in a particular instance the correlation between the sums 
of alternate items greatly exceeds that between apparently equivalent 
forms given, let us say, on successive days, it is evident that what 
Jordan calls ‘pupil variability’? has been present in more than the 
usual degree. It may be that the nature of the test or the conditions 
under which it was given maximize the influence of temporary fluctua- 
tions in mood, interest, effort and so on. This might conceivably 
be the case with certain of the personality inventories or attitude scales 
in common use. A comparison of the correlations obtained by the 
two methods would thus yield information that could not possibly 
be gained by either taken separately. For the relatively high correla- 
tion, when the alternate-item method is used, tells us nothing of the 
stability or instability of individual scores from day to day. As was 
pointed out before, such a correlation may have been spuriously 
increased by correlation between factors unrelated to the purpose of 
the test and changing from moment to moment. On the other hand, 
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the comparatively low correlation between successive forms might 
be due either to technical inadequacies, such as lack of clarity in the 
questions asked, to variations in the conditions of testing or to the 
pupil-variation described by Jordan. A comparison of results obtained 
by the two methods enables us to place the difficulty much more 
precisely, for if the correlation between the sums of alternate items is 
high, it is a fair guess that factors other than technical inadequacies 


are responsible. 











THE EXAMINEE DEFINES “MELLOW” 


HENRY FEINBERG 
Jewish Social Service Bureau, Detroit 


“Mellow” appears as the eighth word in one of the two fifty-word 
vocabulary lists of the Stanford Revision of the Binet-Simon Tests. 
This ‘‘ vocabulary test was derived by selecting the last word of every 
sixth column in a dictionary containing approximately eighteen 
thousand words presumably the eighteen thousand most common 
words in the language,’’ and the words in this test, according to 
Professor Lewis M. Terman, ‘“‘are arranged approximately in order of 
their difficulty,’’* Professor Terman found a high correlation existing 
between the vocabulary test results and the mental ages of the persons 
tested, and maintained the belief that ‘“‘it will be possible before long 
to measure the intelligence level almost as accurately by means of a 
vocabulary list of one hundred crucial words as it can now be measured 
by any existent intelligence scale.’’® 

The average eight-year-old child is expected to define ten words in 
either of the two lists, or twenty words in both. If “mellow” is 
arranged approximately in the order of difficulty in the list, it is 
reasonable to assume that the average child of eight years of age, 
having normal intelligence, should be able to define the word correctly. 
Margaret V. Cobb! reports that of the one hundred two times asked, 
only two children in the kindergarten, one in the 1-A grade and one 
in the 1-B, correctly defined “mellow.”’ It was her conclusion that 
the word should be placed fifteenth in the list, which would be among 
the words passed by the average ten-year-old child. 

The purpose of this study is to re-consider the proper allocation of 
the word ‘“‘mellow” and to determine any influences which may 
mitigate against “‘mellow” being used in the vocabulary list which 
has the diagnosis of intelligence as its aim. 

This report is the result of a study made on fourteen hundred 
thirty-one children and adults examined at the Mental Hygiene Clinic 
of the Jewish Social Service Bureau of Detroit. The individuals 
came from homes of various creeds (Catholics, Protestants and Jews) 
and were referred to the clinic for a variety of reasons. They came 
from homes in which English was spoken, and all were born in the 
United States and have lived here all their lives. Those who were 
six years and older have attended public schools. 
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Included in this study were eight hundred thirty-two females and 
five hundred ninety-nine males. They were further divided into 
diagnostic categories, resulting in the following groups: Two hundred 
nine persons of superior intelligence of whom ninety-seven were 
females and one hundred twelve, males; six hundred twenty persons 
of normal intelligence of whom three hundred forty-six were females 
and two hundred seventy-four, males; two hundred forty-four 
persons of dull normal intelligence of whom one hundred forty-one were 


Table 1. 
Distribution Curve of 1.Q’s in this Study Compared 
with a Normal Distribution. 
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females and one hundred three, males; one hundred seventy-seven 
persons of borderline intelligence of whom one hundred eleven were 
females and sixty-six, males; and one hundred eighty-one persons who 
were feeble-minded of whom one hundred thirty-seven were females 
and forty-four males. 

Table I illustrates the curve of intelligence distribution, as com- 
pared with the normal distribution curve. Theoretically twenty- 
five per cent of persons examined should be above normal in intel- 
ligence, fifty per cent at normal and twenty-five per cent below normal.® 
Our distribution is skewed so that of the females twelve per cent were 
above normal, forty-two per cent at normal, and forty-six per cent 
below normal; of the males, nineteen per cent were above normal, 
forty-six per cent at normal and thirty-five per cent below normal; 
and of the total, fourteen per cent were above normal, forty-four per 
cent at normal and forty-two per cent below normal. 
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Each of these was further subdivided into two groups, the first 
according to the age level at which the vocabulary test was passed 
(VA), the second, according to the chronological age (CA). 

Table II shows the number and percentage of females defining 
‘“‘mellow”’ according to the age-level at which they passed the vocabu- 
lary test (VA). Table III similarly describes the males. In the 
aggregate forty-eight per cent of the females and thirty-six per cent of 
males pass the test. The total picture of the situation is seen in 
Table IV, which shows the number and percentage of persons of both 
sexes defining “‘mellow” according to the age-level at which they 
passed the vocabulary test. Those who had a vocabulary-age lower 
than the average eight-year-old child number two hundred fifty-three, 
of whom only two passed the test, one having borderline intelligence 
and the other being feeble-minded. Those who had a vocabulary-age 
of eight years defined the word correctly in the following proportions 
in the various diagnostic categories: Seventeen per cent of those having 
superior intelligence, four per cent of those having normal intelligence, 
fifteen per cent of those having dull normal intelligence, eleven per cent 
of those having borderline intelligence, and four per cent of the 
feebleminded. Of the two hundred twenty-four persons at this 
vocabulary-age level only eight per cent defined the word correctly. 
In the group having a vocabulary age of ten, the proportions of 
correct definitions were as follows: Thirty-one per cent of those having 
superior intelligence, twenty-two per cent of those having normal 
intelligence, twenty-nine per cent of those having dull normal intel- 
ligencé, thirty-six per cent of those having borderline intelligence and 
thirty-seven per cent of the feeble-minded. Of the total in this 
group thirty per cent gave correct responses. 

Of those having a vocabulary age of twelve, fifty-seven per cent of 
those having superior intelligence, fifty-three per cent of those having 
normal intelligence, sixty-five per cent of those having dull normal 
intelligence, seventy per cent of those having borderline intelligence, 
and eighty-six per cent of the feeble-minded defined ‘ mellow” 
correctly. Sixty-one per cent of those having a vocabulary-age of 
twelve defined “‘mellow” according to its current usage. Of those 
having a vocabulary age of fourteen, eighty-one per cent of those 
having superior intelligence, eighty-two per cent of those having 
normal intelligence, ninety per cent of those having dull normal 
intelligence and ninety per cent of the feeble-minded defined ‘‘mellow”’ 
correctly. In this group were two hundred sixty-six persons, of whom 
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eighty-four per cent gave correct definitions. Of those having a 
vocabulary age of sixteen, ninety-one per cent of those having superior 
intelligence, eighty-nine per cent of those having normal intelligence 
and one hundred per cent of the dull normal defined the word correctly. 
The persons at this vocabulary-age level defined ‘“‘mellow”’ correctly 
by ninety per cent. Of those having a vocabulary age of eighteen, 
ninety-four per cent of those having superior intelligence and one 
hundred per cent of those having normal and dull normal intelligence 
defined “mellow” correctly. Forty-three per cent, or six hundred 
seventeen persons of the fourteen hundred thirty-one persons in this 
study, defined ‘“‘mellow”’ correctly. 


CLASSIPICATION BY CHRONOLOGICAL AGE 


Another method of determining the proper allocation of the word 
“mellow” was made by analysing how these fourteen hundred thirty- 
one persons defined the word according to their chronological ages. 
In making this analysis we bore in mind two criteria for determining 
the validity for a test, the first used by Binet,‘ and the second by 
Terman.‘ Binet used percentages and believed that if sixty-six to 
seventy-five per cent of any age group passed a given test that test 
was valid. Terman discarded the percentage method and stated 
that “‘the guiding principle was to secure an arrangement of the tests 
and a standard of scoring which would cause the median mental age 
of the unselected children of each age group to coincide with the 
median chronological age.” 

Table V shows the number and percentage of females defining 
“mellow” correctly, according to the chronological age and diagnostic 
category. Of the one hundred sixty-three children between the ages 
of eight and fourteen, inclusive, who were of superior and normal 
intelligence only forty-six or twenty-eight per cent, defined the term 
correctly. In the group twenty-three of the one hundred twenty, or 
nineteen per cent of those having normal intelligence defined “mellow” 
acceptably. It is only at age fifteen that more than fifty per cent of 
the children with normal intelligence passed the test, which, according 
to acceptable standards governs the validity of a test, and this means 
that the word “‘mellow” as far as females are concerned, should be 
placed in the list at the fifteen-year age level. 

Table VI illustrates an almost similar situation existing among the 
males. Of the two hundred twenty-three superior and normal boys 
who were eight years to fourteen inclusive, sixty-four, or twenty-nine 
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per cent, defined the word correctly. This is practically the same 
percentage which was obtained for the female group. Of the normal 
boys who were eight to fourteen, inclusive, twenty-two per cent, or 
thirty-three of one hundred fifty-two, defined “mellow” correctly, 
which is approximately the percentage obtained for the female group 
within three per cent. It is only at the age of thirteen that the normal 
boys define “‘mellow” correctly by more than fifty per cent and at the 
age of fourteen by more than sixty-six per cent. 

Table VII shows the number and percentage of both sexes defining 
“mellow” according to the chronological age and diagnostic category. 
Only two of the one hundred fifty-one persons, four to eight years of 
age, inclusive, defined ‘‘mellow” correctly. The group consisted of 
two four-year old superior children, three superior and three normal 
children of five years of age, eight superior, eight normal and three dull 
normal children of six years of age, eight superior, thirty-three normal, 
six dull normal children, and one borderline child of seven years of age, 
and sixteen superior, forty-nine normal, five dull normal, five border- 
line children, and one feeble-minded child of eight years. One 
superior child of seven and one of eight years passed the test. Six 
per cent of the children of nine years of age passed the test, and these 
were children of superior intelligence. Thirty-eight normal, twenty 
dull normal, three borderline, and two feeble-minded children failed 
to pass the test. Thirteen per cent of these ten years of age defined 
“mellow” correctly. Sixty-seven per cent of the children of superior 
intelligence and three per cent of the children of normal intelligence 
at this age-level gave correct meanings. Of the children who were 
eleven years of age eighteen per cent passed the test, in the following 
proportion: fifty per cent of those having superior intelligence, five per 
cent of those of normal intelligence, twenty per cent of those of dull 
normal intelligence, twenty-one per cent of those of borderline intel- 
ligence and none of the feeble-minded. Twenty-one per cent of the 
children twelve years of age, 1.e., fifty per cent of those having superior 
intelligence, thirty-five per cent of those having normal intelligence, 
five per cent of those having dull normal intelligence, none of those 
having borderline or feeble-minded intelligence, passed the test. In 
the thirteen-year old group forty per cent of the children defined the 
word correctly. These were distributed as follows: sixty-five per cent 
of the superior children, forty-seven per cent of the normal children, 
twenty-nine per cent of the dull normal children, twenty-nine per cent 
of the borderline children, and nine per cent of the feeble-minded. 
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Fifty-four per cent of the fourteen-year old children gave correct 
meanings to “mellow.” In this group eighty-six per cent of the 
superior children, fifty-seven per cent of the normal children, sixty-five 
per cent of the dull normal children, twenty-nine per cent of the 
borderline children and nine per cent of the feeble-minded defined the 
term correctly. Of the children who were fifteen years old, fifty per 
cent defined “‘mellow” correctly, in the following proportions in the 
various diagnostic categories: sixty-seven per cent of the superior 
children, sixty-five per cent of the normal children, fifty-nine per cent 
of the dull normal children, twenty per cent of the borderline children, 
and eleven per cent of the feeble-minded. Of those persons who were 
sixteen years and older—six hundred thirteen—sixty-six per cent 
defined the word correctly. Ninety-one per cent of the superior 
adults, eighty per cent of the normal adults, sixty-nine per cent of the 
dull normal adults, sixty-two per cent of the borderline adults and 
twenty-four per cent of the feeble-minded passed the test. Forty- 
three per cent of the fourteen hundred thirty-one persons given the 
word, ‘‘mellow,” correctly defined it. 

Of the three hundred eighty-six children whose ages were between 
eight and fourteen, inclusive, and who have superior or normal intel- 
ligence only one hundred ten, or twenty-eight per cent, defined 
‘““mellow”’ correctly. Two hundred and seventy-two of these were 
normal, of whom fifty-six, or twenty-one per cent, defined the word 
correctly. Of the whole group of fourteen hundred thirty-one persons 
as well as of those of normal intelligence, the word is first passed by 
more than fifty per cent at age fourteen, and by the sixty-six per cent, 
criterion of Binet, at age fifteen. 

We wish to mention two interesting points in the study. In 
Tables II, III and IV, in which we considered the relationship of a 
person’s vocabulary-age and the correctness of his response, we note 
that there is a gradual rise in the percentage of correct responses as 
we ascend from the group of feeble-minded persons to those who have 
superior intelligence. Within almost each vocabulary-age distribution, 
however, there is a tendency for the percentage of correct responses 
to rise as we descend from the group of superior persons to those of 
lower intelligence. Table IV, which gives a good illustration of this, 
reveals in the total that those of superior, normal, dull normal, border- 
line and feeble-minded intelligence reply correctly in the following 
proportions: fifty-seven per cent, forty-seven per cent, forty-two per 
cent, forty-one per cent and seventeen per cent, respectively. When 
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we examine the vocabulary-age of fourteen, for instance, the percentage 
of correct responses rises from eighty-one per cent of those having 
superior intelligence to eighty-two per cent of those having normal 
intelligence, ninety per cent of those having dull normal intelligence, 
and to ninety-one per cent of those having borderline intelligence. 
This seems to indicate that the element of age is a factor in influencing 
the probability of a correct response. In other words, the correct 
definition of ‘‘mellow”’ is the result of an ability which is dependent 
not alone on intelligence but also on educational and experimental 
background of the individual attempting to define the word. 

With very few exceptions the females defined ‘‘mellow” with a 
higher percentage of correctness than did the males. It will be 
recalled that forty-eight per cent of the females passed the test whereas 
thirty-six per cent of males did. This may indicate that the females 
are more apt to come into; contact with this word, and because of 
their associations have a greater opportunity to know its meaning 
than do the males. 

An historical study of ‘Amellow,” similar to the study made on 
“shrewd” in a paper presented at the Psychology Section of the 
Michigan Academy of Science, Arts and Letters in 1934,? indicated 


that “‘mellow” has a meaning which has been quite uniform throughout 
its history.* 


CONCLUSIONS 


1. ‘‘Mellow”’ is too difficult for normal eight-year old children. 

2. The word would more properly be placed between the twenty- 
fifth and thirty-third word in the vocabulary list, which would be 
normal for the average fourteen-year old child. This would advance 
the word six years beyond its present location in the list, and four years 
beyond the place recommended for it in a previous study made by 
Margaret V. Cobb.! 

3. The fact that the correct definition of the word ‘‘mellow”’ is 
influenced by the chronological age of the individual, and by the sex to 
which the person belongs, indicates that the word has greater diagnostic 
value from an educational point of view rather than from an intel- 
ligence point of view. Whether this factor should mitigate against 
its usage in a vocabulary test which has diagnosis of intelligence for its 
aim cannot be decided until other words in the vocabulary list have 





* A New English Dictionary on Historical Principles. 
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been treated in a way similar to that in which “mellow” has been 


studied. 
4. The one factor in favor of the continuance of ‘‘mellow”’ in the 


vocabulary list is the uniformity of the history of its usage. 
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THE STANDARD ERROR OF A MEAN RANK ORDER 


A. C. ROSANDER 


Research Fellow, General Education Board, Bronxville, N. Y. 


In social attitude scale construction, the use of the mean rank 
order to index a series of statements into a graded sequence gives rise 
to the problem of the standard error of this constant. 

Assume that the mean rank orders of n elements are obtained from 
two groups of judges. For Element 1 the first group of judges gives 
a mean rank value of x; while the second group of judges returns some 
different mean value, say y:. If the true rank order of Element 1 is z; 
then the value z, will deviate from it by some quantity e:, while the 
value y; will deviate from it by some other magnitude Z;. This is 
illustrated in the following form which is extended to include the 
general case of the nth element. 








Elements Group I Group II True rank 
1 Zi + &1 Y¥it+ E, 21 
2 Za + C2 Y2 + E, 22 
3 Zs + es Ys + E; 23 
n In + €n Yn + E, Zn 














On the basis of these assumptions, z; equals (x; + e:) but also (y; + E;); 
similarly zz equals (z2 + e2) but also (y2 + #2). For the general case 
z, equals (z, + e€,) as well as (y, + E,). Since z, is equal to each of 
these expressions, they are equal to each other, or (z, + é,) is equal 


to (y, + E,). Changing the equality just given into a more useful 
form we obtain 


(tn — Yn) = (En — Cn) (1) 
or generally 


(zi — ys) = (Ei — &) 
Squaring and summing both sides of equation (1) we get 
X(z; —_ y:)? = Z(E; — e;)? (2) 


Expanding equation (2) and dividing by n elements we obtain 
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x; 222i Lys? = LDE;? 2dE se; Le;? 
n . " = . © (3) 











Considerable brevity may be obtained if we assume that the correlation 
between E, and e, is zero, that the values of E,? and e,? are equal, that 
the values of z,,? and y,? are equal, and that E and e are normally dis- 
tributed. Making use of these assumptions we obtain the following 
expression for the standard deviation of the errors about the true 
rank orders. 
Le;? 2z;? —— LLiYi 


on — n - n n (4) 








The next step is to obtain appropriate equivalents for the two 
right-hand members of equation (4). By the method of successive 
intervals in higher algebra it can be shown that 


mri? (2n + 1)(n + 1) 
_ 6 (5) 





This expression is the mean of the sum of the squares of the first n 
integers. Making use of the product moment formula for the correla- 
tion coefficient, we obtain the following expression for the second term 
of equation (4). 


Zryi = (n?@— 1) , (n +1)? 
Wine wii wae (8) 


where pz, is the rank order correlation between the two sets of mean 
rank orders, x and y. Substituting (5) and (6) in (4) and simplifying 
we obtain the following formula for the standard error of a mean rank 
order. 





n?—1 
a = ae — Px) (7) 
Since it can be shown that the standard deviation of the first n integers 


n? — 1 
is on it is plain that equation can be expressed in a more 


simple form. 





To = T2v/(1 — pay) (8) 


In equations (7) and (8), pz, stands for the rank order correlation 
obtained from the two groups of judges, each group returning a mean 
rank order for each of the n elements. The value of oc, is the standard 
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deviation of either group of mean rank orders; a more accurate value 
would be the average of the deviations of the two groups of values. 
It is apparent that this value co, takes into consideration the 
number of elements, whereas the value p,, takes into account the 
number of judges used in obtaining the mean ranks. As a general 
rule the more judges the higher will be the value of p.,. On the other 
hand the greater the number of elements to be ranked the greater 
will be the standard deviation of the series of mean rank orders. 
Empirical tests show that the assumptions made in the derivation 
are quite justified. It was found that the summation of E£? is fairly 
close to the summation of e?. In addition two correlations between 
values of FE and e, one based upon fifteen judges and the other on five 
judges, showed that they do not differ significantly from zero. 
Application of formula (7) to actual data gives the results shown in 
Table I. In this table, N stands for the number of judges, n for the 
number of elements which in this case are statements of opinion, py 
for the rank order correlation between mean ranks z and y, and o, for 
the standard error of a single mean rank order for any of the twenty- 
five elements. This error amounts to about one-half of a rank order 
in case of twenty-five judges, about one-third of a rank order in case 
of fifty judges, and about one-eighth of a mean rank order for three 
hundred judges. Since the true rank order will fall within three 


TABLE I.—STANDARD ERROR OF A MEAN RANK ORDER 








N Psy n Te 

25 .995 25 .61 

50 .997 25 . 36 
300 .999 25 .12 














standard errors of the mean, the range of error is 1.5 rank orders for 
twenty-five judges, one rank order for fifty judges, and .37 rank order 
for three hundred judges. In terms of per cent of the total range of 


twenty-five ranks, these represent respectively a six per cent, a four per 
cent, and a 1.5 per cent error. 


THE RELATION BETWEEN THE MEAN DEVIATION € AND THE 
STANDARD ERROR ga, 


We have shown elsewhere that the rank order correlation coefficient 
under certain conditions of ranking can be expressed as follows 
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6.’ 
n?— 1 (9) 





ee i= 


where ¢ is the mean deviation of a series of ranks from the true rank 
order. If we solve this equation for « we obtain an expression whose 
similarity to that of c, in equation (7) is readily apparent. 








—— 
c= toa - 0) (10) 
Expressing this in.terms of ¢, we obtain the following equation: 
e = 0:V/2(1 — p) (11) 


Dividing equation (11) by equation (8) we obtain at once the relation 
between e and o,. 


e=a./2 (12) 


From Table I for twenty-five elements p,, is equal to .995, and oa, 
is .51. Therefore from equation (12) « is equal to .72. Now applying 
equation (9) and using twenty-five for n, we obtain for p a value of 
.9951 which checks the value given in the table. The other two 
values in Table I also check. In other words in this particular 
problem all the fundamental equations (7), (9), (10), and (12) are 
quite consistent. 


CONCLUSION 


If the elements to be ranked are fairly evenly graded, and noticeably 
different, then the conditions assumed in the derivations will be 
approximated and the various formulae will give useful results. 
Actually we assume that the mean rank orders are such that they 
may be represented by a series of integral numbers from one to n; 
practically this condition is approached rather than reached. While 
these formulae have been derived in connection with the indexing of 
opinions in attitude scale construction, there is no reason why they 
may not be applied to any set of ranked data which meet the conditions 
herein assumed. 











THE RATING OF COLLEGE TEACHERS ON TEN 
TRAITS BY THEIR STUDENTS 


J. D. HEILMAN AND W. D. ARMENTROUT 
Colorado State College of Education 


During the spring quarter of 1935 the administration of Colorado 
State College of Education (CSCE) requested each member of its 
teaching staff who was teaching a class with an enrollment of twenty- 
five or more students to administer the Purdue Rating Scale for 
Instructors in one of these classes. The instructors were directed to 
administer the scale according to the directions printed on the scale. 
The scoring was done in one of the administrative offices. None of 
the students signed the copy of the scale which he used to rate his 
instructor. 

Ratings of forty-six teachers were made by fifty classes, one teacher 
having been rated by two classes and another by four. In four of the 
classes the number of students was below twenty-five. The size of 
the classes ranged from seventeen to one hundred and twenty-one. 
The average enrollment per class was forty-two. The total number of 
ratings made by the students was two thousand one hundred and 
fifteen. 

The scale provides for a graphic rating on each of ten traits, which 


are regarded as important in the personality of a good teacher. The 
list of traits is as follows: 


a i tae alae adn io Bibs eek aS ETS Is 
2. Sympathetic attitude toward students................... SA 
ee i ce ku ke eh dbodvakearast FG 
4. Liberal and progressive attitude. .................000000- LA 
5. Presentation of subject-matter..................e-e0eee- PS 
6. Sense of proportion and humor....................-008- SH 
7. Self-reliance and confidence.................cceeeceeees SR 
is ok oe adecanecesedeestebeccceeds re 
a ia doa eaceonl eb ibes wen PA 
10. Stimulating intellectual curiosity.....................005 SC 


The letters which appear after a trait in the above list are used to 
designate the trait in some of the tables and charts which appear in the 
article. The first trait in the list will be used to illustrate the nature 
of the graphic rating scale: 


Always appears full of his subject. Seems mildly interested. Subject seems irksome to him. 
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INTEREST IN SUBJECT 


For the purpose of rating each trait the scale provides three descrip- 
tive statements and one hundred points or divisions in the form of a 
graph. The instructions for administering the scale direct the student 
to rate the teacher on each trait by making a check mark at the point 
on the line which best describes the teacher with reference to the trait. 
At the bottom of the scale the student is directed to underline the 
phrase which best places the teacher in comparison with other teachers 
on a five-point or quintile rating. The following phrases are given: 


1. The highest fifth. 

2. Next to the highest fifth. 
3. The middle fifth. 

4. Next to the lowest fifth. 
5. The lowest fifth. 


According to its authors! the purpose of the scale is to provide a 
means for the student to record his judgment of the instructor on each 
of the ten traits of the scale. These judgments, regardless of their 
conformity to truer evaluations, the authors believe would be very 
helpful to the instructor who knew them. They regard these judg- 
ments or student attitudes toward their instructors as, next to the 
learner’s intelligence, probably the most important factors in the 
learning process. On account of the purpose of the scale as described, 
the authors are of the opinion that the reliability and the validity 
of the scale are identical. 

As four different kinds of means or ratings are employed in present- 
ing the results of this study, it was felt necessary for the purpose of 
avoiding confusion to designate these means by specific terms or 
phrases. The instructor’s standing on each of the ten traits Of the 
scale is expressed by the mean of the ratings of the different members 
of his class. This mean or rating has been named the (1) individual 
trait-mean or rating. There are five hundred of these means. Each 
instructor’s standing on the scale as a whole is stated in terms of the 
mean of his ten individual trait-means. This mean or rating has been 
designated the (2) individual scale-mean or rating. Because the 
instructors who were rated by more than one class were counted as so 
many different instructors in computing these means, there are a total 





1 Brandenburg, G. C. and Remmers, H. H.: Manual for the Purdue Rating 
Scale. Lafayette Printing Company, 1928. 
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of fifty individual scale-means. The mean of the fifty means on each 
trait also was computed. This mean is referred to as the (3) group 
trait-mean. Because there are ten different traits there are a total 
of ten group trait-means or ratings. The fourth mean is the mean of 
the fifty individual scale-means. There is only one mean or rating 
of this kind and the name (4) group scale-mean has been applied to it. 

For all of these means standard errors were computed. To find 
the standard errors of the individual trait-means the ordinary formula 
for finding the standard error of a mean was used. At the suggestion 
of T. L. Kelley the error of each of the other kinds of means was made 
by using the same formula, but the SD of the formula was taken as the 
standard deviation of the distribution of means for which a mean was 
found and the N of the formula as the number of means in the distribu- 
tion of means, unless this number was ten or fewer in which case 
the number of means less one was used. 

In the frequency distributions of Table I the five hundred indi- 
vidual trait-means are reported as well as the fifty individual scale- 
means. At the foot of the table appear the ten group trait-means and 
the group scale-mean with their standard errors. The standard 
deviations of the frequency distributions of the individual trait-means 
and scale-means, and the extreme ranges of these distributions are 
also listed at the foot of Table I. 

The group trait-means vary from 86.96 for Personal Appearance to 
73.32 for the Presentation of Subject-matter. According to these 
means the students are much better satisfied with the appearance of 
their teachers, the interest they take in their subjects, and the fairness 
of their grading than they are with their personal peculiarities, the 
degree of intellectual curiosity aroused by them, and the excellence 
with which they present subject-matter. 

The extreme ranges, listed in the last two lines of the table, for 
each of the ten sets of fifty individual trait-means and for the fifty 
individual scale-means show how differently teachers are regarded by 
their classes on each of the ten traits and the average of these traits. 
The extreme ranges vary from thirty to forty points, excepting the one 
for the individual trait-means on the Presentation of Subject-matter 
which varies fifty points or one-half of the range of the entire scale. 
This variation in the individual trait-means for each trait is also shown 
by the standard deviations recorded in the third last line of the table. 
As the standard deviation of the individual trait-means for the 
Presentation of Subject-matter is much larger than the others, it is 
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TasLe I.—Distriputions oF Firry INDIVIDUAL SCALE-MEANS, AND Firry 
INDIVIDUAL TRAIT-MEANS FOR Eacu oF TEN TRAITS; THE MEANS OF THESE 
DisTRIBUTIONS WITH THEIR STANDARD ERRORS, AND THE STANDARD 
DEVIATIONS AND EXTREME RANGES OF THE DISTRIBUTIONS 
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86 — 
84 — 
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80 — 
78— 
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86.96 
.99 


85.08 
.93 


82.64 
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80.32 
1.21 


79.96 
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78.96 
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77.72 
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74.64 
1.24 


74.60 
1.26 


73.32 
1.74 


79.84 
.93 





Standard deviation of individual trait-means 
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6.56 
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probable that the teachers differ more widely on this trait or that 
the students are more sensitive to differences in this than in the other 
traits. 

The ratings by the individual members of a class from which the 
individual trait means for each instructor were computed are not 
recorded in Table I, but they are very significant because they show 
how widely the students differ in rating a given teacher. Asa measure 
of the spread of the students’ ratings we are giving for each trait 
the smallest and the largest standard deviation of the fifty distributions 
of the student ratings on each of the ten traits: 





PA | IS | FG | SR | SA | LA | SH | PP | SC | PS 





40) 5.75) 4.75 
50/28 . 54/27 .30 





























Smallest ¢......... 4.25) 6.85) 7.55) 7.07) 6.90) 9.05) 7.19) 8. 
Largest o......... 18.45/24. 70/24. 50/24. 55/23. 15/23 .95/21 . 25/22. 








Although the students show the most uniform agreement in their 
rating of Personal Appearance, which is the most objective of the ten 
traits, even on this trait the majority of the standard deviations 
appear to be exceedingly large when it is remembered that there are 
only one hundred points in the entire scale and as many as five to six 
standard deviations in most distributions. 

Moreover, there are instructors whose lowest ratings by any 
student on each of eight of the ten traits of the scale fe'l below five and 
whose highest ratings on the same traits were above ninety-five, 
giving a range of ninety or more points in the ratings of a single 
teacher on single traits. On Fairness in Grading the lowest rating 
by any student was in the five to ten interval and on Sense of Propor- 
tion and Humor the lowest rating was in the fifteen to twenty interval. 
These results should serve as another warning to those who have any 
confidence in the accuracy of subjective methods of measurement 
when they are applied by one or even four or five persons. Because 
the professed value of the scale is to inform the instructor of the 
student’s evaluation of him on each of the ten traits, he may, after 
obtaining such a variety of opinion be in a quandary with reference 
to his behavior. If he made any changes in a trait, those who approved 
of it before the change might disapprove of it after the change and 
vice versa. 

However, the mean rating of the entire class may be used to offset 
somewhat this shortcoming of the individual ratings, but the more 
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widely the ratings by the class are scattered about the mean, the less 
reliable and valuable the mean becomes, because this scatter increases 
the size of the standard deviation which in turn increases the size 
of the standard error of the mean. How the uncertainty of the value 
of a true mean increases with the size of the standard error of the 
obtained mean will be explained in the second paragraph below. 

‘From an examination of the standard errors of the five hundred 
individual trait-means, which are not recorded in Table I, it was 
found that some of these were quite large and that they varied con- 
siderably within each of the ten sets of fifty means. The smallest 
and the largest standard errors of the individual trait-means are 
given for each trait in the following tabulation: 





PA IS FG | SR | SA | LA | SH | PP | SC | PS | Means 





.55 -71 | 1.15 | 1.09 .98 | 1.38 | 1.20 | 1.44 | 1.16 | 1.09 .48 
2.93 | 4.24 | 5.35 | 5.01 | 4.27 | 4.42 | 3.90 | 4.40 | 4.90 | 4.54 | 3.95 



































Assuming that a mean is not affected by a constant error, the true 
mean lies between the limits found by adding to and subtracting from 
the obtained mean three times its standard error. For example, 
one teacher received, for Personal Appearance, a rating of 94.65 with a 
standard error of .55. In this case one may be practically certain 
that the true rating for Personal Appearance lies between 93.00 and 
96.30. For Fairness in Grading, one teacher was given a rating of 
64.90 with a standard error of 5.35. The true mean rating in this 
case lies almost certainly between 48.85 and 80.95. As there is a 
very large difference between forty-nine and eighty-one, it can readily 
be seen how important it is to make use of standard errors in the 
interpretation of these means in order that injustice may not be done 
to individual teachers. Approximately one-fifth of the standard errors 
of the five hundred individual trait-means are larger than three points. 
As almost one-fifth of the standard errors on the individual scale-means 
are also larger than three, it follows that many of the ratings on the 
scale as a whole as represented by the individual scale-means are also 
subject to large errors. Moreover, the lowest ratings tend to be the 
least reliable because the standard errors have a marked tendency to 
increase as the individual trait- and scale-means decrease in size. 

For the purpose of discovering how many true differences existed 
among the Purdue ratings, as many different pairs as possible were 
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formed from the fifty individual scale-means, and the difference 
between the members of each pair and the standard error of this 
difference were computed. A total of twelve hundred twenty-five 
differences with their standard errors were obtained. Of these differ- 
ences four hundred ten or about one-third of the total were found to be 
true differences. Although some of the remaining eight hundred 
fifteen differences may be reliable, no assurance of this fact was found 
in our data. 

On the basis of these results each of the fifty individual scale-means 
or ratings was examined for the number of reliably higher and the 
number of reliably lower ratings. In Table II the key number of 
each rating with the number of higher and the number of lower ratings 
isgiven. The table may be read as follows: For the rating or instructor 
236, Key 1, there were found no higher but forty-eight lower ratings. 
For rating thirty-six located in the first line of figures and in the second 
last column of the table, there were found four higher and four lower 
ratings. This table makes it possible to rank the teacher on the basis 
of two criteria—the number rated higher as well as the number rated 
lower than he. By this means it is possible to differentiate the relative 
standing of almost all of the teachers or ratings involved in the study. 
The relative standing of the ratings as determined by this method is 
believed to be somewhat superior to the standing based on the size 
of means which vary in reliability. However, in the first and fifth 
columns of Table II the key numbers of the ratings are arranged accord- 
ing to the size of the individual scale-means, the key number of the 
highest mean appearing first. 

Through the ratings of the Purdue Scale a comparison is made of 
the faculties of Colorado State College of Education and Purdue 
University. The data for the latter were obtained from a bulletin! 
published by the University. From this bulletin means were obtained 
for two hundred ninety-three ratings on each of the ten traits. These 
ratings were made on one hundred fifteen teachers. A total of 
two hundred twelve teachers administered the scale, but, as the return 
of the results was made voluntary, only one hundred fifteen returns 
were received. At Colorado State College of Education (CSCE) all 
of the teachers who taught classes with enrollments of twenty-five 
or more students in any one class were asked to administer the scale 
and return the results. This difference in the way in which the 





1Remmers, H. H.: The College Professor as the Student Sees Him. Purdue 
University, Lafayette, Indiana, 1929. 
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results were obtained may be a disturbing factor in the comparability 
of the results for the two institutions. 


TaBLE IT.—Novumser or INDIVIDUAL SCALE-MEANS WHIcH ARE RELIABLY HIGHER 
AND THE NuMBER WuHicH ARE RELIABLY LOWER THAN THE INDIVIDUAL 
ScCALE-MEAN OF EAcH OF THE Firty RatTINnGs 
Number of Reliable Differences 











Key 2 | Higher | Key 1 | Lower Key 2 Higher | Key 1 | Lower 

23b 0 23b 48 32 4 36 4 
9 0 9 47 36 4 10 4 
34 1 34 36 10 4 45 4 
44 2 44 28 45 5 32 4 
12 2 12 20 42 4 42 1 
ld 2 28 16 8 4 8 1 
28 2 3 16 20 4 20 1 
3 2 25 16 40 5 40 1 
25 2 16 | 16 26 5 26 1 
17 2 17 14 7 10 le 1 
16 2 23a 14 lc 10 39 0 
4 3 la 15 33 11 2 0 
23a 2 ld 13 2 14 7 1 
la 2 4 13 22 14 22 0 
6 2 6 11 43 15 43 0 
46 3 46 9 39 15 30 0 
21 3 21 9 13 17 15 0 
31 3 31 8 18 19 18 0 
37 3 37 7 30 20 33 1 
1b 4 41 8 15 21 13 0 
41 3 1b 4 19 21 19 0 
29 4 29 5 38 29 38 0 
14 3 14 4 35 29 35 0 
11 3 11 4 24 29 24 0 
27 5 27 5 5 37 5 0 
: | 410 - 410 


























Notes.—The number of reliable differences between the members of each pair 
out of a possible twelve hundred twenty-five pairs, which may be formed out of the 
fifty scale-means, is four hundred ten. 

In the key 1 columns the order of the key numbers was determined by the 
number of reliably higher and lower means, and in the key 2 columns their order 
was determined by the relative size of the obtained means, the number for the 
higher mean appearing first. 


At the Colorado institution the individual trait-means for each 
teacher were computed: on the assumption of positive and negative 
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errors in the ratings. No statement is made of the method of com- 
puting these means for the Purdue data. If positive and negative 
errors in the ratings were not assumed in computing the Purdue means, 
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Chart I.—Curves based on the group trait-means and group scale-means of one 
hundred fifteen Purdue University instructors, fifty Colorado State College of Educa- 
tion instructors rated in 1935 and twenty-five CSCE instructors rated from five to 
seven years earlier. 





these means would be five-tenths of a point higher than if this assump- 
tion had been made. 

The curves of Chart I represent the means of the two hundred 
ninety-three Purdue and fifty CSCE ratings on each of the ten traits, 
or the group trait-means, as well as the group scale-means. The solid 
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line curve represents the means for the Colorado institution and the 
orcken line curve those for Purdue University. Only at two points 
aces the broken line lie an appreciable distance above the solid line— 
those over PP and PS. Only one of the eleven differences, that for 
PA, is large enough to be statistically reliable. The dotted line 
represents the means for a group of twenty-five teachers who adminis- 
tered the scale at the Colorado institution in 1927 and 1928. This 
line lies below the solid line throughout its entire extent and below the 
broken line for all but four of the ten traits. None of the differences 
shown by the broken and dotted lines are statistically reliable. It is 
worthy of note that the three lines are practically parallel for almost 
their entire lengths. From the means represented by the curves of 
Chart I and the size of their standard errors it is impossible to say 
with certainty that the ratings of the CSCE faculty are any higher 
than those of the Purdue faculty. Accepting the 1935 CSCE ratings 
as more reliable than the 1927-1928 ratings, because of differences in 
the number of ratings and the methods of sampling the faculty, the 
chances of a higher rating are in favor of CSCE. Had the methods of 
sampling the faculties in the two institutions conformed more nearly 
to the statistical requirements of sampling, the results would be more 
meaningful. From our data we are unable to say whether a faculty 
whose main function is the preparation of teachers will be rated any 
higher on the Purdue Scale than a faculty whose main functions are 
instruction and research. 

As twenty-three of the CSCE faculty were rated in both 1935 and 
five to seven years earlier a comparison of the two sets of ratings is 
made possible. This comparison is expressed by the curves in 
Chart II. The ratings for 1935 are represented by the solid line 
curve and the earlier ratings by the broken line curve. The one 
hundred point scale appears at the left of the chart and the different 
teachers are designated by the numbers along the base line. For 
teacher twenty-three the individual scale-mean was eighty-nine in the 
earlier ratings and eighty-four in 1935. The other points in the curves 
are to be interpreted in a similar manner. For six of the twenty-three 
teachers the differences between the two sets of means are quite large, 
being over five points. However, only one of the differences between 
the scale-means for the individual teachers is large enough to meet 
the requirement for reliability, the difference of 12.20 points with a 
standard error of 4.33. This difference represents the amount which 
the earlier rating exceeded the 1935 rating. Nine of the twenty-three 
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ratings were higher and fourteen were lower in the 1935 than in the 
1927 to 1930 ratings. It is, however, probable that if teachers were 
rated by their students at frequent intervals improvements would be 
made, because the expectation of the rating or examination would 
act as a stimulus to greater effort on the part of the teacher. The 
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Chart II.—Curves based on the individual scale-means of twenty-three Colorado 


State College of Education instructors rated in 1935 and on the corresponding means of 
these instructors rated five to seven years earlier. 


coefficient of correlation between the two sets of means is .69 with a 
standard error of .110. If teachers do not change very much in the 
traits of the Purdue Scale over a period of seven or eight years the 
coefficient of .69 might be accepted as a coefficient of reliability of the 
scale. Evidence for the fact that the traits remain fairly constant 
Over a period of years is furnished by the means of the two sets of 
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ratings. The broken horizontal line of the chart represents the 
group scale-mean for the earlier set of ratings, and the solid horizontal 
line the corresponding mean for the 1935 ratings. The latter mean 
is 1.32 points lower than the former, a difference which is not statisti- 
cally reliable. 

For the purpose of discovering whether there were any true differ- 
ences between the ratings of the instructors who taught in the different 
divisions of the college, the ratings of the forty-six teachers were 
divided into eight groups on the basis of the divisions to which the 
teachers belonged. There are only seven divisions in the college, 
but the health and physical education division was sub-divided on the 
basis of sex. A list of these divisions, the number of instructors in 
each one who were also subjects of the study, the scale-mean for 
each division, the standard deviations of the individual scale-means, 
and the standard errors of the scale-means of each division are reported 
in Table III. The means vary in size from 75.2 for the division of 
fine and industrial arts to 86.5 for the division of music, a difference 
of 11.3 points on a one hundred point scale. However, the standard 
errors of the differences are so large that none of the obtained differ- 


TaBLE II].—Grovup ScaLE-MEANS, WITH THEIR STANDARD Errors, MADE ON THE 
PurpvE RatTine SCALE BY THE INSTRUCTORS IN THE SEVEN DIVISIONS OF 
CoLorabo StTaTE CoLLEeGE oF EpucaTion, SPRING QuARTER—1935 
(Division of Health and Physical Education Has Been Subdivided) 











oe Number of 
Division nnaeenneeinl Mean; SD oM | 
| 
ES ere eee re ere nee ee eee 2 86.5 | 4.60 | 4.60 , 
eat cee aaa. Leal red oe io ba ee ) 81.4 | 5.22 | 1.85 
Literature and languages.................... 6 80.7 | 4.45 | 1.99 , 
Health and physical education (women)....... 2 80.3 | 3.75 | 3.75 | 
he ace ie Eu take Cees ws Whee ewe aden 10 78.3 | 7.23 | 2.41 
a i i als we en ae 6 9 77.1 | 5.76 | 2.04 ‘ 
Health and physical education (men).......... 3 75.7 | 3.87 | 2.74 ( 
BPD GING MENTE BOOB... 5. ccc cccccccccces 5 75.2 | 8.49 | 4.25 
ibaa bu weee SoA aWe ieee nd bheecdeee 46 

















ences are statistically reliable. One is therefore unable to say from 
these results that the kind of subject-matter taught by the instructor 
has any effect upon his Purdue ratings. 
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It might be expected that the instructors in the division of educa- 
tion, who are especially concerned with such matters as the best 
methods of teaching, arousing the student’s interest, and adapting the 
material to the capacity of the learner, would be rated higher by their 
students on the Purdue Scale than the instructors of the other divi- 
sions. ‘Therefore a comparison of the ratings of the nine teachers of the 
division of education was made with the ratings of the forty-one instruc- 
tors of the other divisions. The group scale-mean for the instructors 
of the division of education was found to be 81.47 + 1.85SE, and that 
for the instructors of the other divisions 78.95 + 1.05SE. The differ- 
ence between the two means is 2.52 with a standard error of 2.13. 
To be statistically reliable, to enable one to say with practical certainty 
that there is a true difference, the obtained difference should be 2.78 
times the standard error of the difference. Even though the difference 
were statistically reliable it would be too small to make its practical 
significance of much value. Moreover, as most of the teachers in 
the division of education do not carry a full teaching load on account 
of administrative duties they are not obliged to keep up on as large a 
variety of courses as the instructors of the other divisions. 

In order to determine whether the ratings bore any relation to the 
amount of teaching experience, they were grouped for the teachers with 
the following number of years of teaching experience: seven to twelve, 
ten teachers; twelve to seventeen, thirteen teachers; seventeen to 
twenty-seven, eleven teachers; and twenty-seven and more years, 
twelve teachers. The scale-means for these groups are 79.80, 79.92, 
79.55, and 77.33, respectively. The largest difference, 2.59, occurs 
between the means of the second and last experience groups. This 
difference is far too small to be statistically reliable or have much 
practical significance. These results support previous findings that 
teachers do not improve after they have had several years of teaching 
experience. They also show that there is no decline in their teaching, 
as estimated by their students on the Purdue Scale, after many years 
of experience. 

With the present tendency to retire teachers at a relatively early 
age it was thought desirable, in spite of small numbers, to compare the 
students’ ratings of the teachers falling into different age groups. The 
following age groups were formed: Thirty to thirty-five, seven teachers; 
thirty-five to forty, nine teachers; forty to forty-five, ten teachers; 
forty-five to fifty, five teachers; fifty to fifty-five, five teachers; fifty- 
five to sixty. seven teachers; sixty and above, three teachers. The 
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scale-means for these groups are 80.89, 79.20, 80.28, 77.50, 79.26, 
76.80, and 78.22, respectively. The lowest mean occurs in the age 
group fifty-five to sixty, and the next lowest mean in the age group 
forty-five to fifty. Even the largest difference, 4.09, which occurs 
between the first and the second last age groups is much too small to 
meet the requirement of statistical reliability. There is no certain 
evidence in these results to support the belief that the older teachers 
are rated lower by their students than the younger ones. 

It is undoubtedly true that there are many factors other than the 
traits rated which either raise or depress the students’ ratings, or 
affect the expression of the traits favorably or unfavorably. Among 
these have been mentioned the size of the class, the maturity and sex 
of the rater, the degree of severity used in grading students, the 
difficulty of the course, the ability of the students, and the number of 
different courses taught by the teacher. Some of these factors for 
which data were available have been investigated in this study and 
will be discussed in the following pages. The first one to be considered 
is class size. 

The size of the fifty classes which made the ratings varied from 
seventeen to one hundred and twenty-one. The enrollment in the 
five smallest classes was twenty-five or fewer students and in the five 
largest classes it was fifty-six or more. With such a range in class 
size the data should be adequate to determine whether size has any 
material influence on the ratings or the expression of the Purdue traits 
by the teacher. The correlation method was used to measure the 
effect of size. A significant negative correlation between the size 
of the classes and the excellence of the ratings would indicate that 
large classes depressed the ratings. The product-moment coefficient 
for class size and the individual scale-means was found to be positive 
.236 + .133SH. According to these results one can not be certain 
that the size of the class making the ratings has an influence upon the 
quality of the ratings, but the chances are slightly in favor of an 
increase of the rating with the size of the class. 

From our data it was also possible to determine whether the 
instructor who gave high or the one who gave low grades received the 
higher rating. The severity of the grading for each of the forty-six 
teachers was obtained by computing the mean of all of the grades 
assigned by each of them during the three quarters of 1934-1935. 
The lowest of these means is 2.75 and the highest is 3.75. There are 
five points in the grading system for which three is the average. These 
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means were then correlated with the teachers individual scale-means. 
As the coefficient found was minus .042 with a standard error of .141, 
it is to be inferred from our data that there is no connection between 
the severity of grading and the Purdue ratings. 

The means for the grades were also correlated with the individual 
trait-means for Fairness in Grading. A coefficient of minus .236 with 
a standard error of .133 was obtained. The size of this coefficient 
with relation to its standard error is too small to enable one to say 
with certainty whether there is a true relationship between the two 
items, although the chances are slightly in favor of a negative relation- 
ship. ‘Those who assign the lower grades tend to be rated the higher 
on the trait Fairness in Grading. 

Twenty-five of the forty-six instructors who were rated by their 
classes answered the questions of the Bernreuter Personality Inventory 
in order to determine whether the traits measured by this Inventory 
had any relation to the Purdue ratings. The scores made on it by 
the instructors were correlated with their individual scale-means 
on the rating scale. The correlations of the Purdue ratings with each 
one of the traits measured by the Personality Inventory are as follows: 
With neurotic tendency, minus .188 + .135SE; with self-sufficiency, 
minus .094 + .208SE; with introversion and extroversion, minus 
.105 + .187SE; and with dominance and submission, plus .209 + 
117SE. These coefficients are too low to indicate any significant 
relationship between the traits correlated. They are very similar to 
those obtained for student-teachers whose teaching qualifications were 
rated by their critic teachers. 

To discover whether the Purdue ratings were affected by the 
maturity of the student, the mean of the ratings by the classes of 
the Senior College was compared with the mean of the ratings by the 
classes of the Junior College. Thirty-three Junior- and seventeen 
Senior-College classes participated in the ratings. The mean of the 
ratings by the Senior-College classes is 4.71 points higher than that 
by the Junior-College classes. But as the standard error of this 
difference is 1.83 the difference is not statistically reliable. The 
chances, however, are 99.5 in 100 that the Senior-College student 
makes the higher rating, assuming that there is no real difference in 
the traits between the group of instructors rated by the Senior classes 
and the group rated by the Junior classes. 

It has been suggested that the higher the student’s interest in the 
course the higher he would tend to rate the instructor. For the 
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purpose of testing this assumption, the mean of the ratings made 
by the classes taking required courses was compared with the mean 
of the classes taking elective courses. Of the fifty classes taking part 
in the ratings, sixteen were taking required courses and thirty-four 
elective courses. The mean of the ratings by the former is .49 points 
higher than that by the latter. The standard error of this differ- 
ence is 1.66. These results do not, therefore, support the contention 
that the ratings are raised by the degree of interest in the course as 
measured by required and elective courses. 

Because the women are often reputed to be better teachers than 
the men, a comparison was made between the mean ratings of the 
thirty-one men and the fifteen women who were rated in this investiga- 
tion. The mean of the ratings for the women is only 2.02 points higher 
than that for the men. As the standard error of the difference is 2.09, 
the chances are only eighty-three in one hundred that the women are 
rated higher than the men by their students. Moreover, the practical 
significance of a difference of two points is probably negligible. 

An indication of the reliability of the Purdue Scale may be obtained 
from the degree of relation between the individual scale-means and the 
means of the quintile placements for the different instructors. A 
product-moment coefficient of correlation between these two values 
for the fifty instructors of this study was computed, counting the 
teachers who were rated by more than one class as so many different 
instructors. The size of this coefficient is .75 + .06SE. With the 
same procedure Remmers! obtained a coefficient of .74. Remmers 
used the ratings of one hundred fifteen teachers rated by as many 
different. classes. Brandenburg and Remmers? had each of three 
teachers rated twice by the same class of thirty to thirty-three students. 
The second rating occurred after an interval of three days. The three 
coefficients of .83, .64, and .50, which were obtained, yield an average 
of .66. According to the Spearman-Brown formula this coefficient 
would have been about .75 for class sizes of some forty students, the 
average number in the CSCE classes. All of these coefficients are in 
close agreement not only with one another but with the coefficient of 
.69 which was obtained by correlating the 1935 ratings of twenty-three 
CSCE teachers with their ratings obtained five to seven years earlier. 
In making a comparison of these coefficients it was recognized that 
the coefficient between two sets of means is equal to the coefficient 





1 Remmers, H. H.: Op. cit. 
2 Brandenburg, G. C. and Remmers, H. H.: Op. cit. 
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between the measures from which the means were computed.! From 
these coefficients, it may be inferred that the coefficient of reliability 


TaBLE I[V.—INTERCORRELATIONS OF THE TEN TRAITS TESTED BY THE PuRDUE 
RaTInG ScaLeE as ADMINISTERED TO THE CLAssEes OF Forty-six INSTRUCTORS 
aT Cotorapo State CoLiece or Epucation, SPRING QUARTER, 1935. 
Two Instructors WERE RaTep By MorgE THAN ONE CLAss MAKING 
A ToTaL or Firry Cases 
(The Measures Given Are “‘r’”’ and “‘ PE,’’) 





IS | FG | SR | SA | LA | SH | PP | SC | PS |Mean! 





1. Personal appearance................. -110) .290) .419) .057| .256) .321| .691/ .372) .356 .320 
+ .094/+ .087|+ .079|+ .095/+ .089)+ .086/+ .050/+ .082/+ .083 

2. Interest in subject..................]....-. .402|} .799| .657| .510) .696/ .384) .792) .732) .629 
+ .080)+ .034/+ .054/+ .070/+ .049|/+ .082/+ .034/+ .043 

3. Fairness in grading..................]......}...... .586| .482) .564) .593) .533! .473) .505) .517 
+ .063)+ .073)+ .065)+ .062/+ .069)+ .073/+ .071 

4. Self-reliance and confidence..........]......]......]...... .494; .530| .785| .666) .827) .850/) .692 
+ .072|+ .069)+ .037/+ .053/+ .031/+ .026 

SS  evcncccccccnccsdecccecicocccclecccecleccess -684| .782| .359| .489| .452) .550 
+ .051)+ .037|+ .083|+ .073|+ .076 


6. Liberal and progressive attitude......|......]...... , ‘ 
+ .035|+ .069/+ .071/+ .068 


7. Sense of proportion and humor.......]/......)......]......}....cclecececdeceee. -610} .699) .718) .710 


8. Personal peculiarities................ 





as ebUintabecandbinddsdindenédindscadtenecd .618} .595) .537 
+ .059)+ .061 

9. Stimulating intellectual curiosity.....}......]......)......)......)..0.ccbeccccclecece }eceee. .870| .659 
+ .023 

10. Presentation of subject-matter.......)......)......J..cc0.[eccceclecccechececcclececcchecececleceee. .657 

Co Ee Oe a a a ee ee Se a 615 



































1 The first mean is the average of the correlations of Personal Appearance with every other trait; the remaining means 
are the averages of the correlations of the given trait with each of the other traits, exclusive of Personal Appearance. 
The general average does not include correlations involving Personal Appearance. 
for the Purdue Scale, when used by classes of some f orty students, 
is about .75. : 

One factor which vitiates the accuracy of the rating of a specific 
trait is the rater’s general impression of the standing of the individual 





’ Kelley, T. L.: Statistical Method. Macmillan Company, p. 178. 
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as a whole. A student who has a high general impression of an 
instructor tends to rate him higher on each specific trait than he would 
if his general impression were low. This is known as the “halo effect.” 
Among the factors which might give the student a high general opinion 
of his instructor are such items as his official position on the faculty, 
his public addresses, and the quality and number of his publications 
or any other outstanding desirable trait or qualification. As the 
“‘halo effect’? operates more effectively on those specific traits which 
are least objective, some idea of the “halo” for these traits may be 
gained by comparing their intercorrelations with their correlations 
with an objective trait. Thus the average of the coefficients for the 
relatively objective trait of Personal Appearance with the nine other 
more subjective traits, is approximately only half as large as the 
average of the intercorrelations of the nine other traits. These results 
appear in the last column of Table IV. In this table may also be 
found the forty-five coefficients obtained by correlating each one of the 
ten traits of the Purdue Scale with each one of the nine other traits. 

The average of these intercorrelations, exclusive of Personal 
Appearance, is .62. This coefficient is more than twice as large as 
similar coefficients given by Brandenburg and Remmers,! and con- 
siderably larger than .37, the coefficient given by Stalnaker and 
Remmers” as a measure of the “halo effect.”” The coefficients of 
Table IV indicate that in general the “halo effect” of the ratings for 
the subjective traits is at least three-tenths. 


SUMMARY 


The results of this study were obtained from the fifty 1935 class- 
ratings of forty-six teachers and from the twenty-five 1927-1930 
class-ratings of twenty-five teachers of the teaching staff of the 
Colorado State College of Education. The ratings were made by the 
Purdue Rating Scale for Instructors. The number of students who 
made the 1935 ratings was two thousand one hundred and fifteen. 

The students of any class vary very widely in the rating of any 
individual instructor on each of the ten traits measured by the Purdue 
Scale. The standard deviations for the distributions of the ratings 





1 Brandenburg, S. C. and Remmers, H. H.: Op. cit. 

2 Stalnaker, J. M. and Remmers, H. H.: ‘‘Can Students Discriminate Traits 
Associated with Success in Teaching?” Journal of Applied Psychology, December, 
1928. 
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made by the different classes on the trait of Personal Appearance for 
which they were the smallest vary from 4.25 to 18.45; and for the trait 
of Presentation of Subject-matter for which they were the largest 
they vary from 4.75 to 27.30. Not only do the students differ widely 
in rating a single teacher, but the teachers are rated very differently 
by their classes on each trait as well as on the scale as a whole. The 
standard deviation on the fifty trait-means for “Interest in Subject”’ 
is 6.56; the one for “‘ Presentation of Subject-matter”’ is 12.33; and the 
standard deviation for the fifty individual scale-means is 6.54. 

Teachers are rated much higher in some traits than in others. One 
teacher was rated ninety-two on one trait and only fifty-nine on 
another. Moreover, the entire group of forty-six teachers is rated 
much higher on some than on other traits, the difference between the 
highest and the lowest of these ratings being approximately fourteen 
points. 

The standard errors on many of the means are so large that a 
teacher’s true individual trait-mean or even scale-mean may differ 
very widely from his obtained mean. 

Although the individual scale-means, or standings on the scale as a 
whole, for many of the teachers are considerably larger than for those of 
others, only about one-third of the differences are large enough to be 
statistically reliable. 

Only on one of the traits, that of Personal Appearance, does the 
group of teachers from the staff of the Colorado State College of 
Education rank reliably higher than a group from the staff of Purdue 
University. 

The group of twenty-three teachers which was rated in both 1935 
and five to seven years earlier showed a slight decrease in the later 
over the earlier ratings. 

No reliable differences were found between the ratings of the seven 
different divisions of the college, nor between the rating for the 
division of education and all of the other divisions combined. More- 
over, no reliable differences were obtained among the ratings of those 
groups of teachers who differed from five to twenty or more years in 
their teaching experience, and none among the ratings of the groups 
who differed from five to thirty or more years in age. 

The factors of class size, severity of grading, the traits measured 
by the Bernreuter Personality Inventory, the student’s interest in the 
course, the sex of the teacher, and the maturity of the rater, as evi- 
denced by our data, can not be said with certainty to have any effect 
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upon the ratings; although in one case, the maturity of the rater, the 
chances are very high that the more mature student rates the higher. 

The reliability of the scale in the hands of college students is 
probably about .75. The degree of “‘halo effect” for all but the trait of 
Personal Appearance may be expressed by a coefficient of at least 
three-tenths. 
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AN INVESTIGATION OF THE RELATION EXISTING 
BETWEEN STUDENTS’ GRADES AND THEIR 
RATINGS OF THE INSTRUCTOR’S ABILITY TO TEACH 


MILTON L. BLUM 
College of the City of New York 


In the extensive bibliography on ratings or estimations, one fails 
to find a reference to the problem: Are students influenced by their 
standing in the course in rating instructors? This paper will endeavor 
to solve this question as well as other concomitant ones. 

During the summer session of the College of the City of New 
York, classes are in session for eight weeks. While this is exactly 
one-half the length of time of a regular session, the hours per week are 
doubled and, therefore, the number of hours of teaching are the same. 
The study was conducted in two classes, one in general psychology 
and the other in industrial psychology. The classes were given three 
examinations during the session and a final examination at the close 
of the semester. At the first session the classes were informed that 
each examination would receive a weight of twenty-five per cent in 
the determination of the final grade. 

At the session before the final examination each class was given 
the following information: ‘‘ You will receive a small sheet of paper. 
You are to rate the instructor either ‘A,’ ‘B,’ ‘C,’ ‘D,’ ‘E,’ or ‘F,’ 
according to your estimation of his ability to teach the course. Con- 
sider the average instructor as obtaining the grade C. As you are all 
upper classmen you undoubtedly have in mind some instructors that 
are average. If you believe that your instructor is average then give 
him a grade of C. If above average, depending upon whether he is 
good or excellent, grade him B or A. If he is below average grade 
him D, E, or F depending entirely upon your opinion of him as com- 
pared with others similar to him who would be graded D, (just passing), 
E, (poor), F, (very poor). Do not sign your name to this sheet. 
Fold it and place it in the envelope provided.”” A member of the 
class sealed the envelope. After this was completed additional 
paper was distributed. On this sheet each student was asked to 
sign his name, estimate his own grade in the course, and also rank 
the instructor. Both classes were given the assurance that the data 
would be in sealed envelopes and would not be available to the instruc- 
tor until all actual grades had been determined and turned over 
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to the proper administrative authorities. The experimenter felt 
that this precaution was necessary so that the students would act as 
honest, fearless subjects. 

The third part of the experimental procedure took place at the 
final examination. Each student was asked to write the grade he 
anticipated, after having fulfilled all the requirements of the course. 

The reason for having each student rating the instructor twice is 
obvious. Signed ratings may not be an honest expression of the 
student’s opinions. Unsigned ratings are likely to be more honest. 
However, by comparing the distributions of the signed and unsigned 
grades, if no difference or very little difference is found to exist, the 
estimations can be considered reliable and the statistical investigation 
can take place. What was actually found is depicted in Table I. 


TaBLE I.—StupeEntT RatTInGs or INSTRUCTOR 











General psychology Industrial psychology 
Grade 
Signed Unsigned Signed Unsigned 
A 11 10 5 4 
B 16 16 21 22 
C 2 3 2 2 

















According to the unsigned ratings, five of the total number rated 
the instructor as average. The remainder rated him either B or A. 


TaBLeE II].—ComBINED COMPARISON OF ACTUAL GRADES AND EsTIMATED GRADES 
BEFORE FINAL EXAMINATION 











Estimated grade before final examinations 
Grade 
A B C D 
A 7 2 
B 4 19 1 
C 7 10 
D 1 5 1 

















No estimations of D, E, or F were made. By examining the data 
it can be seen that only two students raised the grades upon signing. 
While it cannot be stated definitely who it was that changed his 
mind, the data would indicate a high enough reliability to warrant 
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the assumption that the signed ratings are as good indicators of 
estimations as is possible to obtain. 

By comparing the actual grades received and the estimated marks 
before the final examination, one finds a close relationship existing 
in all grades except D. Here, one finds the tendency to overestimate. 
Table II presents the data when the student actually knew seventy- 
five per cent of his grade. 

Table III presents the estimates after the student has completed 
the final examination. Although he does not actually know his 
mark on the examination, he should be in a better position to estimate 
his grade. Three students failed to estimate their grade at the final 
examination. One actually obtained a D; the other two received the 
grade of B. This explains the discrepancy in the totals of Tables 
II and III. 


TaBLe III.—EstimaTep GRADES AFTER FinaL EXAMINATIONS AND ACTUAL 











GRADES 
Estimated grade after final examination 
Grade 

A B C D 
A 7 2 
B 1 19 2 
C 4 14 
D 1 4 1 

















By computing the coefficient of contingency of the data in Table 
II it is found that the result is plus .66. The relationship of the data 
in Table III is found to be plus .74. When it is remembered that in a 
fourfold classification the contingency coefficient cannot exceed 
plus .86 then it must be noted that the relationship existing is high. 

Seventy-eight per cent of the students receiving the grade of A 
actually estimated their grade as A. Eighty-seven per cent of those 
receiving B made correct estimates. Seventy-eight per cent of 
those receiving a C were correct in their estimation. However, 
students receiving a D do not estimate their grade correctly but tend 
to overestimate. 

With reference to the problem whether the actual grade received 
bears any relation to the students’ rating of the instructor, one finds, 
as a result of an examination of Table IV, that no relation exists. 
Transcribing the letter ratings of the students into numerical terms, 
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it is found that the average ratings of the instructor’s ability is eighty- 
eight per cent. The students who received an A rated the instructor 
eighty-six per cent. The students receiving B rated him eighty-eight 
per cent. The students obtaining the grade C ranked him eighty-six 
per cent, and finally, those receiving D rated the instructor eighty- 
nine percent. It can be seen, therefore, that the actual grade received, 
which undoubtedly is a measure of the amount of material the student 


TaBLE IV.—ComPaARISON OF ACTUAL GRADES AND STUDENTS’ RATING OF 











INSTRUCTOR 
Student rating of instructor 
Grade 
A B C 
A 2 6 1 
B 9 13 2 
C 2 14 1 
D 3 4 0 














has learned as a result of completing the course, does not influence a 
student in his estimation of the instructor’s ability to teach the course. 

A very similar condition exists when the relation between the 
estimated grade after the final examination and the rating of the 
instructor is examined. Table V presents the material. 


TaBLE V.—CoOMPARISON OF ESTIMATED GRADES AFTER FINAL EXAMINATION AND 
STUDENT-RATING OF INSTRUCTOR 











a Student-rating of instructor 
grade A B c 
A 1 5 2 
B 9 15 2 
C 4 15. 
D 1 














It is found that eighty-four per cent is the average rating of the 
instructor by the students who estimate their ability as A. The 
students who estimate their grade as B or C, rate the instructor 
eighty-seven per cent. 


It can be seen, therefore, that the results of Table IV and Table 
V are statistically similar. 
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CONCLUSIONS 


1. If a statistical basis for grading is used, students can estimate 
their grades correctly provided it is not lower than a C. In this 
experiment 74.5 per cent of all students actually estimated their grade 
correctly. When the students obtaining the grade of D are eliminated, 
the percentage of correct estimation goes up to eighty with no student 
missing his grade by more than one letter grade. 

2. Students are not influenced by their actual standing or estimated 
standing in the course in rating the instructor on his ability to teach 
the course. Regardless of whether a group of students receives an 
A, B, C or D in the course the estimation of the instructor’s ability 


remains essentially the same and closely resembles the average estima- 
tion of the total group. 





HETEROPHORIA AND READING DISABILITY 


PAUL A. WITTY AND DAVID KOPEL 


Northwestern University 


Because of the spatial arrangement of the eyes, a fixated object is 
always seen from a slightly different angle by each eye. This occurs 
even when the eyes are perfectly balanced; 7.e., when corresponding 
retinal points are stimulated simultaneously.'* Any object will 
therefore produce slightly disparate images in each eye. Ordinarily 
the disparity is so small that the two images are readily combined or 
fused into one. Whipple™ states: ‘‘Under normal conditions, the 
balance and the innervation of these (six pairs of extrinsic eye) muscles 
are such that both eyes move in concert, 7.e., the eye movements are 
automatically codrdinated for purposes of single vision. .. . In 
some individuals, however, there exists more or less ‘imbalance,’ or 
asymmetry of eye movement, so that the two eyes fail to ‘track,’ as it 
were.” 

The effect of eye muscle imbalance (heterophoria) upon the reading 
function is rather plausibly explained by Eames :* 

Exophoria is an anomaly of ocular coérdination, consisting of a tendency of the 
visual lines to deviate outward from parallelism when the eyes are at rest.* This 
tendency is carried over when the eyes are in use and results in imperfect conver- 
gence, fixation, and so forth. When the eyes deviate, the retinal images no longer 
fall on corresponding points of the two retinae, and the resulting mental picture 
is blurred and confused. Double vision may result and the visual image caused 
by the stimulation of one eye, may be imperfectly superimposed on that of the 
other. In the interest of clear, single vision, nature introduces an element of 
mental and muscular compensation which requires an excessive innervation 
resulting in fatigue. Fatigue varies with the degree of defect, the length of the 
period during which compensation is required, and the physical stamina of the 
individual. Although compensation can clear vision, fluctuation occurs, causing 
the eyes to waver, and this increases with fatigue. 


Perhaps the first group study of heterophoria in relation to reading 
disability was made by Eames® in 1932. His comparisons of one 





* Esophoria, the other and less frequent type of hyperphoria, denotes a tend- 
ency wherein the visual lines deviate inward. This condition produces results 
similar to those of exophoria. 

+t Dearborn,® in 1929, presented a paper which has apparently not been pub- 
lished, describing ‘‘the operation of a second factor [sinistrality was the first], the 
imbalance or insufficiency of the eye muscles (heterophoria) which even in degrees 
generally considered by ophthalmologists to be ‘within normal limits,’ appears to 
cause confusion in word recognition and reversal of letters.” 
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hundred fourteen reading disability cases with one hundred forty-three 
unselected children showed “a definite tendency of the reading 
disability group toward a greater degree of exophoria’” than was 
found among the control group. The difference was “highly reliable 
statistically” and was “regarded as being very significant.” 

Selzer,!! in 1933, claimed that ninety per cent of thirty-three 
reading disability cases evinced eye-muscle imbalance, whereas only 
nine of one hundred unselected children showed this defect. Betts? 
reported similarly, in 1934, that ninety per cent (of an unstated 
number) of his “‘severely disabled readers” were “visually charac- 
terized by faulty binocular coérdination and astigmatism.” 

As a result of his investigations, Betts* constructed the “Betts 
Ready to Read Tests,” a series of stereoscopic slides used in the 
Keystone Ophthalmic Telebinocular, ‘‘devised to appraise the coérdi- 
nate action of the eyes.” These tests, it is claimed, permit the 
hitherto impossible ‘‘scientific study . . . of the binocular coérdina- 
tion required in reading.”’ 

The various eye conditions emphasized by Eames, Selzer, and 
Betts, and their relationship to poor and to effective reading were 


investigated in the present study by means of the Betts tests and 
apparatus. 


SELECTION AND DESCRIPTION OF SUBJECTS 


The experimental group consisted of one hundred children of | 
IQ eighty or above whose reading scores upon the Metropolitan 
Achievement Tests were the lowest—one semester or more below their 
grade norms—among those of two thousand children in grades three 
to six inclusive of the Evanston Public Schools (District 75); the 
controls were normal readers of IQ eighty or above whose reading 
scores upon the Metropolitan Test were equivalent to or above their 
grade norms. 

Data concerning the two groups are presented in Table I. There 
are sixty-six boys and thirty-four girls in the problem group. Com- 
parable numbers of boys and girls are in the control group, and the 
two groups contain proportional numbers of children from the same 
grades and schools. The average IQ of the problem children is 
ninety-six; of the control, one hundred four. The average chrono- 
logical age of the problem children is ten years, four months; of the 
control, nine years, two months. Age-grade distributions show the 
problem children to be retarded one semester in 3B; the retardation 
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increases to one grade in 5B. On the average, the problem group is 
one year older than the non-problem. 

Reading retardation of .67 grade in 3B increases to 1.1 in 5B, with 
the average retardation about one grade for the problem children. 
For the non-problem children, the average acceleration in reading is 
about one-half grade. Thus, the disparity between the groups is 
approximately one and one-half grades. (These averages from 
the Metropolitan tests were corroborated by the Gates Silent Reading 


TABLE ].—DESCRIPTION OF PROBLEMS AND NON-PROBLEMS 








Average Average 
N Grade | Average} Average| reading reading 
placement| age IQ achieve- achieve- 
ment I ment II 
1 2 3 4 5 6 7 
Problem......... 100 |: 3B-6B 10-4 96 — .96 —9.9 
Non-problem..... 80 | 3B-6B 9-2 104 + .39 +1.1 























1. Problems: Children one semester or more below their grade-norm in reading, 
on the Metropolitan Achievement Test. 

Non-problems: Controls, matched for school, grade, sex; reading ability 
equivalent to or above grade-norms. 

4. Age range—problems: 7-4 to 14-7; non-problems: 7-4 to 12-4. 

5. Kuhlman-Anderson Intelligence Tests. Range for problems: 80-116; for 
non-problems 83-123. 

6. Figures, in grades, represent grade placement minus grade equivalent in 
reading (October, 1934). 

7. Figures, in months, represent mental age minus reading age (October, 1934). 


Tests.) The reading achievement of the non-problems is in general 
consonant with mental age, although their achievement is slightly 
below mental age in grades three and four, and slightly above in 
grades five and six. 


METHOD OF STUDY 


In attempting to ascertain the relationship of each ocular item 
to reading ability, the writers found it necessary to define degrees of 
reading ability as well as deviations in reading skill. The reader will 
recall that reading achievement was reported above (in Table I) 
by comparing reading grade equivalent with grade placement. All 
problem cases—by selection—were below their grade-norms. The 
average retardation of the problems and the average acceleration 
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of the non-problems were computed by grade; thus, the average 
deviation of each grade was established. Deviations above (plus) 
and below (minus) the grade averages were then computed for each 
case, thus eliminating the effect of grade differences in level of reading 
ability, and making achievement comparable for each member of the 
group. 

The visual classification in Table II corresponds roughly to the 
functions measured by the Betts Tests of Visual Sensation and 
Perception.* The probable effect of visual status upon reading 
achievement was studied by comparing the average attainment of 
the problem and non-problem children in each visual function. 


TaBLeE II.—Vision RELATED TO READING ACHIEVEMENT 

















. Non-problems, Percent- 
a reading achieve- ages of 
ment N 
Categories of vision 
Aver- Aver- 
Devi- age Devi- age 
N ations*| devi- ations* | devi- PIN P 
ations* ations* 
Slow fusion: 
ET RY ES eee 15) —2.057| — .1371) 1|— .3083) — .3083 
EE PET 12} 2.705, .2254 
i rr re... ohne deckanesws 2 .165| .0825 
ls 88. ceca waweeweue 29 .813} .0280) 1;— .3083) — .3083/29 1 
a de i 18} 2.062} .1145)15) 2.8101 . 1873/18) 20 


3. Lateral muscle imbalance (with fusion)....} 8 .140| .0175)12} —1.0556) — .0880) 8} 15 
4(1, 2,3). All fusion and muscle balance diffi- 
SRE Teese ome ee 55| 3.015) .0548)28 1.4462) .0517/55) 35 
4a. All fusion and muscle balance difficulties 
with deficient acuity and ametropia...}25| 2.435) .097 |14) 1.176 .084 |25) 18 
5. Deficient acuity and ametropia (only)t...| 3|—1.961| — .6537| 5 .4224| .0845) 3 6 


6. Deficient acuity (only)Tt................. El CU ee 3 

Se: MINED,  ncccuncueseeecooewns 15 .157| .0105) 15) —2. 1506) — . 1434/15) 19 
8(5, 6,7). All: Deficient acuity and ametropia |21| — 2.600) — . 1238/20) — 1.7282) — .0864/21) 25 
9(4, 8). All visual difficulties............... 76 .415) .0055)48)— .2820| — .0059/76; 60 
10. No visual difficulties.................... 24|— .432)—.018 |32 .2954| .0092)/24) 40 





























* Figures are plus when not preceded by minus signs. 
t“ Deficient acuity” is an error of refraction and is subsumed logically under “ametropia.” 
It is listed separately, however, to follow the classificatory order in the Betts records and materials. 


The eye condition and reading attainment of the problem and of 
the non-problem group will be discussed first; then comparisons 
of the two groups will be set forth. 
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FUSION DIFFICULTIES, LATERAL IMBALANCE, AND READING 
ATTAINMENT 


Problems.—In the slow fusion group (problems), those with slow 
fusion at far distance only (group la) have an average deviation of 
—.14 grade. Those suffering with slow fusion at near, or reading 
distance (group 1b)—a condition ostensibly prejudicial to good reading 
—have an average plus deviation of .23 grade. The two cases with 
far and near fusion (group lc) are above the average (of the problem 
class). The combined cases (group 1) also are slightly above the 
average— +.03 grades—in their reading achievement. 

No fusion, a condition asserted by Selzer,'!? Eames,’ and Betts? 
to be a cause of reading disability, failed apparently to produce its 
alleged effect—group 2 is .11 grade above the average. 

Combining all children who have fusion and muscle difficulties in 
group 4, which contains fifty-five per cent of the cases, and considering 
their educational status, one obtains for the group a slight positive 
deviation of .05 grade. 

Included in group 4 are a considerable number of cases suffering 
not only from fusional and muscle balance difficulties, but also from 
deficient acuity and ametropia (group 4a). In the problem group 
twenty-five per cent (and in the non-problem class eighteen per cent) 
of the cases are doubly “afflicted.” The average deviation, however, 
is + .10 grade. 

Group 3, with lateral muscle imbalance—which is said by Selzer,” 
Kames*® and Crider‘ to cause lack of fusion—is .02 grade above the 
average.* Fendrick” similarly found that lateral muscle imbalance 
does not contribute to reading disability, thus corroborating this 
finding.t The work of Farris'® further supports this view. 





* This group includes one case of vertical imbalance; the vertical balance of 
two children could not be tested because of serious refractive defects in one eye. 

+ Fendrick has recently reported a thorough study of the ‘‘ Visual Character- 
istics of Poor Readers.” Sixty-four pairs of poor and good readers were tested 
with a modified Snellen chart, the Jaeger chart, the Betts apparatus, and were 
examined by optometrists (in the Division of Optometry, Columbia University). 
The results of all tests revealed the same general trend: A lack of relationship 
between ocular aberrations and reading disability (except when teaching methods 
rely preponderantly upon visual techniques). . . . ‘‘Measures of lateral eye- 
muscle coérdination did not yield any evidence that reading disability cases 
manifested a more pronounced aberrance in muscle-imbalance than the control 
cases. The reliability of this finding was established through three distinct 
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Three cases of deficient acuity and ametropia (group 5) and three of 
deficient acuity only (group 6) show rather marked negative devia- 
tions. When these cases are combined with group 7 (ametropia only) 
—where there is a very slight positive deviation—the result is a 
negative deviation of .12 grade (group 8). Although this is one of 
the largest deviations found, it is so small that it can scarcely be said 
to indicate the presence of a general factor contributing to poor reading. 
Monroe"! has corroborated this: ‘‘ Lack of adequate visual acuity was 
not found to be a highly frequent [contributing] cause [in reading 
disability cases] . . . and did not distinguish the reading-defect groups 
from other groups of children who did not have reading difficulties.” 

All cases having visual difficulties (group 9) yield an average 
deviation of +.006 grade; all subjects exhibiting no visual difficulties 
(group 10) show an average deviation of —.02 grade. In none of the | 
main groupings, then, does one find any significant reading deviation 
associated with a visual aberration or difficulty; in none is the average 
deviation more than a few hundredths of a grade above or below the 
average of the whole group. 

Non-problems.—The one case of slow fusion (at far distance only) 
deviates minus .31 grade. The no-fusion group deviates .19 grade 
above the average. (This approximates the +.11 of the problem 
group.). The lateral muscle imbalance group is .09 below the average. 
(This corresponds closely to the +.02 grade deviation of the problem 
group.) The composite of these groups (group 4) is .05 grades above 
the average. Group 4a, containing cases with various combinations 
of every visual defect, resembles the problem group: A positive 
deviation of .08 grade corresponds closely to the problem group’s 
average of +.10 grade. 

Group 8—all cases of deficient acuity and ametropia—has an 
average deviation of —.09 (similar to the —.12 for the problem 
children). 

Although the problem and non-problem group deviations reported 
above are similar in quantity (and direction), the difference in numbers 
in the problem and non-problem groups, and the minuteness of the 
deviations, have effected in the combinations in groups 9 and 10— 
all cases with and without visual difficulties, respectively—a reversal 
of plus and minus signs. The quantities, however, remain distinctly 
similar (and minute). Thus, in non-problem group 9, the average 





approaches which consistently failed to produce any significant variation in the 
group comparisons.” 
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deviation is —.006; in the problem group, it is +.006. In non- 
problem group 10; it is +.009; in the problem group, —.018. 

Thus again one finds no significant deviations. Differences in 
reading associated with various visual conditions mensurable only in 
hundredths and thousandths of a grade can not be deemed of importance. 

Comparison of Problems and Non-problems.—Certain differences 
between the problem and non-problem children merit consideration. 
Striking is the fact that twenty-nine per cent of the former and only one 
per cent of the latter exhibit slow fusion. The reason for the disparity 
is not known. That this factor is important etiologically (in the case 
of the poor readers) is doubtful, in light of the data reported herein. 

The size advantage of the problem children in group 4—fifty-five 
per cent in comparison with thirty-five per cent in the non-problem 
group—is due, of course, to the heavy weighting of the slow fusion 
group. 

Group 8 contains cases of deficient acuity and ametropia, uncompli- 
cated by other difficulties. Adding to this category the cases included 
in group 4a, a percentage of forty-six is obtained for the problem class, 
and forty-three for the non-problem class. The similarity is note- 
worthy. Although the larger percentage of problem cases in group 
4a may appear suggestive of a significant difference, a detailed analysis 
of “‘serious’”’ cases in both problem and non-problem groups (not 
shown in Table II) reveals that fully half of these cases display 
reading deviations above the average. 

The similarity in percentages of cases with deficient acuity and 
ametropia (twenty-one and twenty-five, and forty-six and forty-three, 
for problems and non-problems, respectively) has been noted. In 
group 9 (all visual difficulties) the difference between the seventy-six 
per cent for the problems and sixty per cent for the non-problems is 
attributable largely to the preponderence of slow fusion cases in the 
former classification. 

Analysis of the data leads to the conclusion that the poor readers 
are not characterized by a greater incidence of visual defects and 
anomalies than are good readers. With the exception of the slow 
fusion group the percentages among the non-problem are somewhat 
higher than among the problem children in practically all other visual 
(defect) categories studied. Furthermore, the various visual factors— 
slow fusion, no fusion, lateral muscle imbalance, deficient acuity, and 
ametropia, singly and in combination—appear unrelated to reading 
deficiency. 
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Are visual defects, then, unassociated with reading achievement? 
The answer, predicated upon the group analyses above, is not an 
unequivocal ‘“‘yes.” True it is, that the visual defects studied do 
not appear to cause or to contribute to reading disability. Neverthe- 
less, before asserting that visual defects are not etiologically con- 
cerned in poor reading, one should recognize this alternative: Visual 
defects may impede the reading progress both of poor and of good 
readers. In other words, it is suggested that although the relative 
incidence of various visual defects among poor readers does not 
distinguish them from good readers, visual defects are not necessarily 
unrelated to reading achievement: Correction of defects may improve 
the reading ability of poor readers and of those who exhibit no reading 
difficulties. Studies of individual cases with marked visual defects 
which have been corrected or ameliorated—and followed by appropri- 
ate remedial procedures—generally show definite improvement in 
reading attainments.’ 

In support of the hypothesis advanced above another point may be 
made. Although large percentages of visual defects are found in 
good as well as in poor readers, it should be remembered that indi- 
viduals vary greatly in their capacity to make successful adaptations 
and to compensate for defects. Not only in psychoanalytic literature 
does one find instances of adequate and acceptable compensation 
for somatic and psychic inferiorities. Ophthalmologists are fairly 
well agreed that fusion, for example, is dependent upon more than— 
and can occur without—ocular muscle balance. Essential is “a 
mental desire for a single image.”’! This is not, however, a suggestion 
that differentiation between poor and good readers should be made 
on the basis of the effectiveness of an hypothesized compensatory 
function (if a satisfactory criterion could be found for this function). 

One must guard against over-simplification. To dismiss sum- 
marily further consideration of the effect of the visual-perceptual 
act upon efficient reading performance is to overlook the complexity 
of the process, its many-sided relationships to general bodily and 
mental health. One must bear in mind, too, not only the somewhat 
doubtful reliability and validity* of the Betts tests (and of all eye- 
muscle tests) but also the lack of adequate standardization of many 
items contained in the Betts battery. 





* Insofar as these muscle-imbalance tests purport to measure a function sig- 
nificant in the reading process. 
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In conclusion, it is clear that the cause of reading disability (as an 
entity) lies in no single visual factor. Every visual (defect) item 
considered seems to play a relatively negligible réle in the attainment 
of poor and good readers. Nevertheless, normal vision is indubitably 
essential to maximum attainment. Therefore it is highly desirable 
that each child, upon entrance to school and at regular periods there- 
after, should receive thorough ophthalmological study. When this 
attention is not available, the teacher, nurse, or school psychologist 
will find apparatus such as the Keystone Ophthalmic Telebinocular*® 
helpful in isolating quickly many serious. visual defects. In every 
case of reading disability search should be made for visual difficulties. 
Such examination is a vital item in the comprehensive individual 
diagnosis which should precede remedial endeavor. 
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FACTOR ANALYSIS OF SOCIAL AND ABSTRACT 
INTELLIGENCE 


ROBERT L. THORNDIKE 


George Washington University 


It was suggested by E. L. Thorndike some twenty years ago that 
there might be three main types of intelligence—abstract, mechanical, 
and social. Since that time a great number of tests have been devel- 
oped to measure abstract intelligence, and smaller numbers to measure 
mechanical intelligence and social intelligence. One of the better 
known of the tests purporting to measure social intelligence is the 
George Washington Social Intelligence Test. This test has been 
criticized because of its low correlation with other tests which were 


TABLE I.—INTERCORRELATIONS OF SUB-TESTS 








2 3 4 5 6 7 8 9 10 
1 530 | 577) 265 481 394 466 581 301 401 
2 saa 414 | 317 287 284 294 322 143 275 
3 vw 315 375 315 354 399 307 313 
4 we 326 196 143 178 077 176 
5 ee 337 345 423 286 477 
6 wa 379 411 196 290 
7 “ 450 233 313 
8 _ 219 369 
9 ies 304 
10 
































1 The intercorrelations for the group of five hundred were computed by Mr. 
Saul Stein and appear in an unpublished MA thesis in the library of George Wash- 
ington University. I wish to thank him for letting me use them. 


also supposed to measure social intelligence (7.e., Gilliland’s Socia- 
bility Test), and because of high correlations obtained between it and 
tests of abstract intelligence. 

The purpose of the present investigation is to determine whether 
this Social Intelligence Test measures any unitary trait which is 
distinct from the ability measured by an abstract intelligence test. 
This problem will be approached through a factor analysis of the 
sub-tests of this test and of one of the standard abstract intelligence 
tests (the George Washington Mental Alertness Test). The correla- 
tional matrix of the ten sub-tests from these two tests will be analysed 
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by Thurstone’s simplified method of factor analysis, in an effort to 
determine the fundamental factors running through these two tests. 

These two tests have been given regularly to new students enter- 
ing George Washington University. The results reported here are 
obtained from the scores of a group of five hundred students taken at 
random from those entering in 1932 and 1933 and a group of two hun- 
dred fifty students taken from those entering in 1934. The correla- 
tions had been computed for these two groups separately.! The 
results from the two groups were combined, giving weight in inverse 
proportion to the variance, under the assumption that the true 
correlation was the same in both cases. This amounted to giving 
the correlations from the group of five hundred twice the weight of 
those from the group of two hundred fifty. In most cases, the cor- 
relations were quite similar for the two groups, and it is not thought 
that any important error was introduced by the method of combining. 

The intercorrelations are given in Table I. 

The ten variables studied were the following: 


Mental alertness test. 


ERSTE een eae en Vocabulary 
EEE PRN oa WO ae General information 
hike acumen hcemnes Learning ability 
ee ee elma tai alate Arithmetical reasoning 
ES eee ae Comprehension 

Social intelligence test. 
ne eek caren eee eid Judgement in social situations 
anna aad ao anes Recognition of mental state 
ik bk oa a 6a. a soca Observation of human behavior 
a ee al ek Memory for names and faces 
i Oe oA ie rsa wanes Sense of humor. 


A factor pattern of three factors was fitted to these correlations. 
The three factors reduced the residual correlation to approximately 
what would have been expected by chance. There is some doubt as 
to whether the third factor was necessary. The factor loadings for 
each variable are given in Table II. 

We see that the first factor is overwhelmingly the most important. 
The first factor is weighted positively in every test, and corresponds 
roughly to what is general to all ten tests. It accounts for about 
nine times as much of the covariance as does the second factor. The 
second factor has predominantly positive weights for the Mental 
Alertness Test and negative for the Social Intelligence Test, though 
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most of the weights are small. The third factor is of even less impor- 
tance, and discriminates the last subtest of each test from the others. 

These results suggest that insofar as the parts of either of these 
tests measure a general trait of the individual, it is the same one that 
is measured by the other test. The size of the first factor loadings 
suggests that comprehension and use of words accounts for most of 
what is measured both by the Mental Alertness Test and by the Social 


TasB_Le II.—Facror LoapinGs 











Factor 
Variable 
1 2 3 
1 .781 .009 .184 
2 .579 . 206 .151 
3 .673 .141 .165 
4 .396 .418 — .004 
5 .651 — .005 — .253 
6 .548 — .140 .053 
7 . 587 — .251 .115 
8 .671 — .243 .070 
9 .405 —.1l11 — .076 
10 .579 — .076 — .420 
ey ee .357 .040 .035 














Intelligence Test. There is evidence in the second factor that the 
parts of the Social Intelligence Test do have a little in common that 
they do not share with the abstract test. The third factor seems to 
be a speed factor. 

Our conclusion is, then, that though the George Washington 
Social Intelligence Test may tap slightly some unique field of ability, 
it measures primarily the ability to understand and work with words 
which bulks so large in an abstract intelligence test. 


BOOK REVIEWS 


HELEN M. WaLKER, Editor. The Measurement of Teaching Efficiency. 
New York: The Macmillan Co., 1935, pp. XXIV + 237. 


In 1930 Kappa Delta Pi set aside one thousand dollars as an award 
for the best report of research during the following biennium on a 
subject to be assigned by the Executive Council and as recommended 
by its Committee on the Research Award. The subject assigned for 
the first competition was ‘‘The Measurement of Efficiency in Teach- 
ing.” Out of twenty-three competing studies none was found to be of 
such pre-eminent value as to merit the reward, but three of them were 
deemed sufficiently significant to be recommended for publication. 
Under the editorship of Dr. Walker they have been prepared for the 
press. 

Having spent last year, largely to no purpose, in trying to discover 
a means of selecting prospective teachers, the reviewer read these three 
studies avidly. The third study by Dr. G. L. Betts, ‘‘ Pupil Achieve- 
ment and the NS Trait (novice-superior) in Teachers,” is introduced, 
I suppose, as a warning to would-be researchers. It is full of beautiful 
statistical techniques, but no statistical juggling will produce reliable 
findings from essentially poor (one almost said ridiculous) tests. 
Naming a trait or a test doesn’t make it one. Marking the tests on the 
eenie, meenie, minee, mo principle would have given just about the 
same results as were found in this study. No wonder the Editor had 
qualms of conscience about printing it. 

The other two studies are excellent ones, especially the first— 
W. H. Lancelot’s ‘Study of Teaching Efficiency as Indicated by Certain 
Permanent Outcomes.” A new technique was introduced and the 
report upon it is classical in its logical directness and simplicity of 
statement. Engineering students in the Department of Mathematics 
of Iowa State College take six courses in Mathematics in a prescribed 
sequence. ‘The device used by Lancelot was to judge the teachers by 
two standards: (1) The quality of the work done by students in all 
later courses of the sequence; and (2) the persistence of students in 
continuing to the end of the sequence. The idea is simple and sound. 
The pupils of the better teachers should do better work in subsequent 
mathematical courses and should be stimulated to persist to the end of 
their training. Space forbids a detailed description of the study, but 
it is so good that it should be made a required reading in all courses in 
Educational Research. I would certainly have awarded it the prize. 
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The second study by Messrs. Barr, Torgerson, Johnson, Lyon and 
Walvoord, entitled “‘The Validity of Certain Instruments Employed 
in the Measurement of Teaching Ability,” is a careful evaluation of 
nineteen paper tests and scales in common use. These included 
intelligence tests, teacher rating scales, professional information tests, 
personality and vocational interest tests and so forth. Many criteria 
were used, four of them being composites of the tests and scales used, 
and the fifth being gain in pupil achievement. The last should work, 
but unfortunately it doesn’t. The best device was the psychological 
examination. Very high in the list were the various rating scales, 
which shouldn’t work, being subjective ratings, but apparently do. 
This study does not settle the question at issue but throws some light 
upon it, and gives some idea of the difficulty which attends its solution. 

This review is much too long. The reviewer apologizes for it, but 
pleads the importance and urgency of the problem with which the 
workers have wrestled. PETER SANDIFORD. 

University of Toronto. 


G. D.Stopparp, anD B.L. Wetiman. Child Psychology. New York: 
The Macmillan Co., 1934, pp. XII + 419. 


As the authors state in the preface, ‘‘this book represents an 
attempt to base a psychology of the child directly upon the outcomes 
of research . . . It is designed for students and other workers whose 
background in psychology permits them to explore one of its special 
branches.” For this reason the book is better suited as a text for 
intermediate or advanced courses in child psychology than for an 
elementary course. The authors’ rigid adherence to the factual 
outcomes of research is admirable and refreshing. The bibliography 
contains four hundred ninety-three titles nearly all of recent date. 
Paralleling the available research literature, the emphasis is on infancy 
and the preschool period. Prenatal life and adolescence are not 
covered. | 

The style is rather uneven, some sections being much more readable 
than others. Some of the chapters are compendia of facts and figures 
more in the manner of reviews or reference material with little of the 
setting being provided for the reader who is unfamiliar with the original 
researches. The discussions of certain groups of researches, however, 
are much more than mere summaries, many of them being highly 
critical, and attempts are made to harmonize conflicting results, the 
authors’ point of view on controversial issues being clearly stated. 
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The organization is topical with particular emphasis on methods of 
research, intelligence, language and learning. Because of this topical 
organization there is a certain artificiality in the presentation which 
does not readily reveal developmental sequences, or show the child as a 
whole. This, of course, is probably the necessary result of a book 
planned with the ideal of rigid adherence to the results of research that 
are necessarily concerned with isolated aspects of behavior at specific 
age levels. 

The chapter on learning is especially controversial, the one on 
methods well organized and thorough; the one on “‘the meaning of 
intelligence’? emphasizes the authors’ point of view that optimal 
intellectual development requires appropriate environmental stimula- 
tion and that intelligence is improved by environmental opportunities. 
Only a seven-page chapter is devoted to the emotions. The chapter on 
artistic capacity is concerned almost entirely with musical capacity 
as measured by the Seashore Measures of Musical Talent. The fourth 
part on personality and adjustment pulls together in a rather satis- 
factory manner the meagre factual material available on this aspect of 
child psychology which has been neglected in most recent texts. 

On the whole the book is a valuable addition to the rapidly grow- 
ing number of child psychology texts and, in some respects, it holds a 
rather unique place among them. DoroTHEA McCartay. 

Fordham University. 


Epitra J. Varon. The Development of Alfred Binet’s Psychology. 
Psychol. Rev. Mon. 207; Vol. XLVI, 3. Princeton; Psychological 


Review Company, 1935, pp. 129. 


Of all the names of psychologists Binet’s is probably the best known 
to the general public. Few there are who have not heard of the Binet 
tests of intelligence. It was therefore with pleasurable anticipation 
that I began to read this work. I must confess to some disappoint- 
ment. Not that the job has been done badly. But the emphases have 
been placed on minor rather than on major elements. Binet to psy- 
chology stands for a man who was fertile in devising new experiments 
and using simple means to solve important problems. His genius 
along these lines was nowhere better shown than in his intelligence 
tests. If Miss Varon had shown step by step how these tests came into 
being, of the borrowings that Binet made from the work of others, she 
would have added interest to her account. Nowhere does she supply 
detailed specific information about a single field, although the indus- 
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trious reader can get it by reference to an excellent bibliography. The 

work, thought competent, is rather dull. The genius of Binet does not 

shine through it. PETER SANDIFORD. 
University of Toronto. 


G. W. Barune, Editor. Practical Applications of the Punched Card 
Method in Colleges and Universities. New York: Columbia 
University Press, 1935, XXII + 422. 


The purpose of this book is to show the possibilities of Hollerith 
Tabulating Machines for the analysis of various statistical problems 
arising in the several departments of a university. The thirty-nine 
chapters in the book treat chiefly with the development and principles 
of the punched-card method, applications in the registrar’s office, 
in university business offices and miscellaneous administrative applica- 
tions; applications in psychological, educational, medical, legal, 
agricultural and other types of research; and in the solution of statisti- 
cal problems. Each chapter within a section has been prepared by an 
authority who has had considerable experience in adapting the 
machines to the problems in his field. While this has the advantage of 
making each chapter authoritative, there is considerable repetition of 
material and the book varies widely from chapter to chapter as to 
style, merit, and specificity. 

The fifth part of the book, ‘Applications in Psychology and 
Educational Research,” is of particular interest to readers of this 
Journal. Dr. H.C. Toops’ chapter, ‘Questionnaire Construction and 
Analysis,” is both the most extensive and intensive chapter in the 
section. Detailed information is presented on the construction, 
pre-coding and analysis of questionnaires. In addition to this use 
of the equipment Toops lists some twenty other fields in which he has 
used the machines. Toops’ evaluation of the method may be seen 
in the following quotation. ‘The right arm and brain of any research 
man can be multiplied by at least ten by the judicious use of Hollerith 
equipment. Our estimate is that the effectiveness of most psychologi- 
cal testing and personnel research can be multiplied by about fifteen 
by this means!’’ 

Chapter two, “Analysis of College Test Results,” and Chapter 
three, “Analysis of High School Test Results,” by Dr. Ben D. Wood 
and Dr. E. F. Lindquist, respectively, are clear, concise explanations 
of the use of punched-card methods for analyzing test results. Lind- 
quist indicates the use of the method for itemanalysis. Itisregrettable 
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that further emphasis was not placed on this treatment in either of 
these chapters in view of the experience of the authors in item analysis. 

The fourth chapter, ‘‘The Scoring of Vocational Interest Tests,” 
by E. K. Strong, Jr., briefly describes two methods for scoring multiple 
response tests where each response receives a variety of weights. 
The first method is the one originally described by Rulon and Arden in 
1930 for use with standard equipment, while the second method 
requires a special attachment but is superior in terms of speed. 

Dr. L. M. Terman and Dr. Maude A. Merril in Chapter five, 
“‘ Analysis of Intelligence Test Scores,” give a very brief discussion of 
the adaptation of the equipment for item analysis using as an example 
the data for the revision of the Stanford-Binet test. 

The final chapter in this section is by Dr. T. L. Kelley and describes 
the preparation of a key for a free association test together with 
methods of scoring such a test for eight traits. 

The last section of the book, ‘‘ Applications of Specific Methods 
to the Solution of Statistical Problems,” is of interest particularly to 
statisticians. The first chapter in this section by Dr. H. C. Carver 
lays special emphasis on the use of the automatic multiplying punch 
in the computation of moments, product moments, distributions and 
correlation tables. The procedure is easily followed. 

The other chapter in this section, ‘‘ Uses of the Progressive Digit 
Method,” by Dr. A. E. Brandt, treats the analysis of variance, 
covariance, curve fitting, difference tables and multiple correlation. 
This is a very practical and useful chapter if read in conjunction with 
the references cited. 

Perhaps the most valuable contribution of this book is to supply 
the research worker and statistician with detailed information as to the 
wide applicability of the apparatus. This material may help him to 
convince other departments of the advantages of such equipment and 
by codperative action, perhaps secure the equipment for the institu- 
tion, thus making these powerful, but expensive, machines available 
for his own investigations. Jack W. Dunuap. 

Fordham University. 


A. LAWRENCE LOWELL: At War with Academic Traditions in America. 
Cambridge, Mass.: Harvard University Press, 1934, pp. XIV + 
357. 


“Surely the essence of a liberal education consists in an attitude of 
mind, a familiarity with methods of thought, an ability to use informa- 
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tion rather than in a memory stocked with facts, however valuable such 
a storehouse may be”’ (p. 36). With this sentence President Lowell in 
1909 set not alone the tone of his inaugural address, he likewise struck 
what proved to be the central tendency of his administrative thought 
and action during his years of leadership at Harvard. A reading of 
the addresses and writings brought together in this volume makes it 
clear that this “‘war with academic tradition,’ no matter on what 
front the action took place, had always the purpose of lifting the 
individual intelligence above the patterns arranged by the traditions of 
our colleges and universities. 

The reader will find here, in the essays and addresses reprinted, the 
vigorous expression of ideas calculated to cut through tradition to 
the heart of the educational task. And it matters little whether the 
occasion for expression is participation in inaugural exercises at 
another institution or a discussion of the function of examinations. 
Always, and appropriately, the purpose of higher education is made a 
back-drop against which the ideas are presented. The reader will 
also find, as he follows through selected sections of President Lowell’s 
annual reports, the practical effects, at Harvard, of a theory in action. 
The volume, therefore, much as it is a record of the impact of one 
individual upon a single institution, is more than this. It is the record 
of an educational faith. 

A topical index makes the book especially valuable for those who 
wish to check, guide or stimulate their thought about differing problems 
within the field of higher education today. But those who so use the 
book are almost certain to feel the urge to explore it further. No 
matter where one turns a sense of watching institutional practice 
undergo reorganization is conveyed, and from any point therefore one 
may move backward or forward in the story with no break in con- 
tinuity. It is, in short, a highly satisfactory volume. 

Of particular interest, at least so it appears to this reviewer, is the 
recurring reference to the place of the college in our scheme of higher 
education. In this day of the junior college, and the accompanying 
upper and lower divisions within institutions of full collegiate grade, 
the time is ripe for a reconsideration of purpose and function for each of 
the units in higher education. The desire for academic respectability 
as each change occurs may easily fasten tradition the more firmly upon 
us. We have looked, for instance, upon the graduate school influence 
within the college as one to be welcomed. President Lowell saw this 
differently. ‘‘Graduate schools have had one effect that has not been 
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sufficiently noticed. They have tended to block the advance of the 
college”’ (p. 212). A wide-spread recognition of this fact would 
give new vitality to the somewhat halting reorganization taking place 
in our colleges. In any case, to follow through President Lowell’s 
effort to remove this block is to become familiar with changes instituted 
at Harvard designed to bring new character both to undergraduate and 
to graduate work. H. Gorpon HUvLiFIsH. 
Ohio State University. 


JAMES J. WausH. Education of the Founding Fathers of the Republic. 
New York: Fordham University Press, 1935, pp. XII + 377. 


Ordinary sources of information fail to reveal the dominant rdéle 
played by scholasticism in early American education. By an analysis 
of theses distributed at commencement in Colonial colleges, Walsh 
has shown that higher education was dominantly scholastic until after 
1800. The success of this type of college training is demonstrated by 
the fact that a large proportion of the signers of the Declaration of 
Independence, and other important leaders in early American life 
received such an education. Following the Revolution scholastic 
philosophy was gradually dropped from the college curriculum. This 
trend is clearly discernable by about 1810. It is pointed out that with 
this change the acquisition of information “‘took the place to a great 
extent of training in thoughtfulness and in discrimination of truth from 
falsity.” The author apparently considers the shift in emphasis to be 
unfortunate. 

This well-written and interesting treatise is a real contribution to 
the history of education in America. Some of the author’s views, how- 
ever, will be challenged. Although arguments for the liberal education 
of scholasticism is thought-provoking, the contributions that such a 
viewpoint might make to contemporary education must be evaluated 
largely in terms of present-day needs and cultural patterns which are 
markedly different from those of Colonial days. 

Mies A. TINKER. 


University of Minnesota. 

















